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20.  aCont'd) 

Average  system  response  time  and  average  network  message  traffic  are 
computed  for  four  management  approaches;  centralized,  a  master/slave  scheme, 
a  synchronized  scheme,  and  a  new  scheme  called  delayed  synchronization.  The 
new  scheme  is  based  on  daily  operation  without  synchronizing  updates,  supported 
by  nightly  merging  to  produce  identical  data  copies  throughout  the  system. 
Timeliness  information  is  associated  with  every  individual  data  item  and  users 
are  given  a  choice  in  retrieval  transactions  between  quick  response  and  most 
recently  updated  values. 
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Chapter  1 

INTRODUCTION 

Database  management  systems  and  distributed  data 
processing  are  important  approaches  currently  being  developed 
for  control  of  what  we  see  today  as  explosions  in  the  amounts 
of  information  being  handled  by  computer  systems  and  in  the 
numbers  of  computer  systems  available  to  do  useful  work. 
Recently,  research  on  combining  these  two  approaches  into 
distributed  database  management  systems  (DDBMSs)  has  been 
increasing  in  an  attempt  to  produce  uniform  data  management 
policies  for  networks  of  communicating  computer  systems.  The 
goal  is  to  provide  users  with  a  general  information  resource 
without  requiring  them  to  know  specific  things  about  how  it  is 
built  in  order  to  use  it  effectively  (e.g.,  what  data  is 
located  on  which  disk  of  which  computer  system).  This 
dissertation  develops  an  evaluation  methodology  to  compare  the 
expected  cost/performance  of  distributed  database  management 
schemes  in  particular  operating  environments.  Four  specific 
examples  of  schemes  (centralized,  master/slave,  synchronized, 
and  delayed  synchronization)  are  evaluated  in  detail  and 
compared  in  terms  of  average  system  response  time  and  average 
network  message  traffic. 


The  basic  methodology  of  evaluation  begins  by  analyzing 
the  management  scheme  and  then  identifying  the  specific 
control  paths  and  the  network  data  flow  required  to  handle 
both  updates  and  retrievals.  The  control  and  data  flow 
information  is  used  to  develop  a  queuing  network  model  of  the 
entire  system.  Assumptions  about  the  operating 
characteristics  of  the  system  (such  as  communications 
connections  and  delays,  processing  power,  disk  rates,  and 
transaction  input  distribution)  are  incorporated  in  the  model 
so  that  ave' age  system  response  time  and  average  network 
message  traffic  can  be  calculated. 

Three  of  the  particular  management  schemes  evaluated  are 
specific  cases  of  standard  approaches  to  database  management: 
a  totally  centralized  scheme,  a  master/slave  scheme  wherein 
one  node  is  in  charge  of  the  distributed  system,  and  a 
synchronized  scheme  wherein  all  of  the  nodes  cooperate  to 
manage  the  database.  The  fourth  scheme,  delayed 
synchronization,  is  a  new  proposal  that  basically  consists  of 
delaying  the  synchronization  of  update  transactions  among  the 
multiple  data  copies  of  the  system.  The  intention  is  to 
improve  the  average  system  cost/performance  by  delaying  the 
resource  requirements  and  communications  costs  of 
synchronization  until  less  expensive  times.  We  will  see  that 
delayed  synchronization  is  appropriate  for  applications  having 
only  a  small  probability  of  updates  conflicting  and  for  users 
who  would  be  willing  to  deal  with  out-of-date  information  if 
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they  could  get  it  more  quickly  or  cheaply. 

Applying  the  evaluation  methodology  to  the  particular 
management  schemes  illustrates  the  effects  that  certain 
implementation  decisions  can  have  on  system  performance.  For 
example,  use  of  a  physical  ring  to  implement  the  virtual  ring 
for  synchronized  management  imposes  sequential  communication 
without  the  possibility  of  improvement  by  broadcasting. 
However,  any  cost/ per formance  predictions  that  are  made  will 
need  to  be  verified  against  data  from  real  systems  as  those 
systems  are  built  and  have  their  characteristics  fed  back  into 
the  modeling  and  analysis  techniques  used.  In  the  meantime, 
this  dissertation  represents  a  first  step  toward  evaluating 
the  system  cost/ per formance  of  alternative  distributed 
database  management  schemes  and  points  out  a  specific  tradeoff 
in  information  timeliness  that  can  be  used  to  improve 
performance  in  synchronized  systems. 

The  following  presentation  begins  with  background  and 
motivation  for  distributed  database  systems  (Chapter  2)  and 
leads  into  the  new  proposals  for  a  timeliness  tradeoff  and  a 
delayed  synchronization  management  scheme  (Chapter  3).  The 
evaluation  methodology  is  developed  by  treating  the  four 
specific  cases  in  detail  (Chapter  M).  Comparing  the  results 
of  the  performance  analyses  leads  to  conclusions  about  the 
cost/performance  of  the  schemes  in  particular  operating 
environments  (Chapter  5).  Finally,  the  conclusions  are 


k 

brietly  summarized,  several  extensions  to  the  work  are 

suggested,  and  some  of  the  broader  implications  are  introduced 
( Chapter  6 ) . 


Chapter  2 


BACKGROUND 


The  purpose  of  this  chapter  is  to  provide  some  general 
background  information  for  the  reader  who  is  just  being 
introduced  to  distributed  database  management.  Some 
motivation  and  research  goals  are  presented,  along  with  a 
brief  description  of  the  general  problem  areas.  Additional 
detail  is  contained  in  Appendix  A.  The  more  knowledgeable 
reader  is  invited  to  proceed  directly  to  the  next  chapter. 

2.1  DISTRIBUTED  DATABASE  MANAGEMENT 

Distributed  databases  are  often  categorized  according  to 
how  the  data  are  distributed  among  the  nodes  of  the  system. 
At  one  extreme  there  is  the  partitioned  database,  which  means 
the  data  are  divided  up  among  the  nodes  without  any 
duplication  at  all.  At  the  other  extreme  is  the  duplicate 
copy  database,  where  a  complete  copy  of  the  entire  database  is 
located  at  each  node.  Between  these  two  extremes  is  a  huge 
variety  of  schemes  where  copies  of  some  portions  or  all  of  the 
database  are  located  at  some  or  all  of  the  nodes.  In  order  to 
begin  to  realize  the  potential  that  distributed  database 
systems  offer  for  improved  availability  and  reliability,  we 


6 


will  not  consider  the  partitioned  database.  We  will  refer  to 
the  rest  of  the  possibilities  generically  as  multiple-copy 
databases,  and  only  call  out  the  extreme  case  of  complete 
duplicate  copies  at  each  node  when  the  distinction  is 

important. 

Once  the  database  is  distributed,  we  can  consider  whether 
or  not,  or  how,  to  distribute  the  management  system  which 
controls  it.  The  two  kinds  of  approaches  that  have  been 
proposed  are: 

•  the  master/slave  approach:  a  single,  distinguished, 

master  node  is  in  charge  of  managing  the  database  and 
directs  whatever  management  activity  is  to  be  done  at  the 
other,  slave  nodes;  and 

•  the  synchronized  approach:  all  of  the  nodes  cooperate  on 
an  equal  level  of  responsibility. 

We  can  see  immediately  that  both  of  these  approaches  will 
require  many  messages  to  be  sent,  either  between  master  and 
slaves  or  among  the  peers,  to  accomplish  the  coordinated 

management.  If  the  nodes  are  physically  very  far  apart,  the 
delay  due  to  the  communications  may  become  quite  noticeable  to 
the  user.  In  addition,  we  quickly  suspect  that  the  master 
node  may  be  a  bottleneck  due  to  update  contention  for 

resources  and  that  the  overhead  involved  in  coordinating 
activity  among  peers  may  be  considerable. 
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2.2  MOTIVATION 

The  general  goal  of  the  research  on  distributed  database 
systems  is  a  combination  from  the  database  management  and 
distributed  processing  areas: 

•  to  manage  the  entire  information  resource  of  the 
distributed  system  as  a  whole  in  a  uniform  and  integrated 
fashion  (i.e.,  the  database  management  goal),  and 

•  to  do  it  in  a  way  that  takes  particular  advantage  of  the 
multiple  instances  of  the  various  hardware  and  software 
resources  that  occur  when  individual  computer  systems  are 
linked  together  by  a  communications  network  (i.e.,  the 
distributed  processing  goal). 

This  goal  is  sometimes  expressed  more  specifically  from  a 
user  viewpoint  in  terms  of  physical  data  independence, 
availability,  and  reliability.  By  physical  data  independence, 
we  mean  that  a  user  of  the  distributed  database  system  does 

I 

not  need  to  know  anything  about  specific  pt)ysical  locations  ] 

(e.g.,  what  disk  on  which  node)  or  storage  layout  schemes  for  j 

I 

the  data  of  interest  to  him.  A  DDBMS  would  be  responsible  for 

j 

translating  the  user's  logical  request  into  the  ultimate  j 

physical  storage  commands  required  to  access  the  data,  and  j 

thus  provide  the  data  Independence  being  sought.  | 

By  availability  we  mean  that  the  system  has  the  resources 
available  to  do  the  user's  job  right  now:  the  necessary  data 
is  accessible  and  there  is  a  processor  (or  even  several  1 

. . 
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ooinmunicatinti  ones)  ready  to  do  the  work.  A  DDDMS  has  the 
poteatial  to  offer  improved  availability  by  providing  multiple 
copies  of  data  resources  so  that  contention  over  sharing 
single  copies  can  be  r- educed  and  by  providing  multiple 
processing  resources  so  that  a  total  system  workload  migiit  be 
divided  into  pieces  which  could  be  independently  or 
cooperat i vel y  processed  in  parallel. 

By  reliability  we  mean  that  the  system  as  a  whole  can 
continue  to  operate  even  if  some  of  its  individual  components 
should  fail,  requiring  repair  and  then  reincorporation  into 
the  system.  A  DDBMS  may  provide  higher  reliability  if 
multiple  copies  of  data  resources  can  be  used  as  on-line 
back-ups  in  case  one  copy  fails  and  if  multiple  processing 
sites  can  be  used  as  on-line  back-ups  in  case  a  node  fails. 
Such  a  back-up  capability  obviously  depends  on  some 
flexibility  in  routing  commun icat ion  and  user  requests  around 
crashed  nodes  and  on  whether  duplicate  data  can  be  made 
accessible  to  back-up  nodes  (such  as  by  remote  requests  if 
complete  duplicate  copies  of  the  database  are  not  kept  at  each 
node)  . 

2. 3  SOME  PROBLEM  AREAS 

The  research  that  has  been  done  on  distributed  database 
systems  has  identified  the  following  general  problem  areas 
(which  are  described  in  more  detail  in  Appendix  A): 


ihi  II  ■ifcaw 
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Integrity:  correctness  of  individual  data  items  in  the 
context  of  the  whole  database  as  a  model  representing 
some  enterprise,  where  a  data  item  is  the  smallest 
independently  accessible  unit  of  data; 

organization :  location  and  arrangement  of  data  and 

directories  to  facilitate  efficient  responses  to  user 
requests ; 

security:  protection  of  data  from  accidental  or  malicious 
corruption  and  prevention  of  unauthorized  access; 
data  incompatibility:  limitations  on  data  usage  because 
of  its  structure  or  representation; 

reliability:  availability,  failure  recovery,  explicit 

indications  of  database  status  with  regard  to  individual 
operations,  and  prevention  or  reduction  of  situations 
where  data  are  inaccessible; 

implementation  experience:  what  is  required  to  build 
systems  for  actual  use;  and 

cost/ performance:  how  effectively  the  system  responds  to 
the  users  at  what  cost. 

The  state  of  the  art  in  distributed  database  systems  at 
the  time  that  the  research  for  this  dissertation  began  can  be 
briefly  summarized  as  follows: 

•  In  general,  data  incompatibility  was  being  ignored,  often 
by  consideration  specifically  of  networks  of  homogeneous 
processing  nodes. 


•  All  of  the  other  areas  except  security  and 
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cost  /  per  form<inc:e  had  been  or  were  being  addressed  at 
various  levels  of  detail. 

Tile  research  tended  to  focus  on  integrity,  mostly  in  terms  of 
concurrent  access  control;  what  was  required  to  manage  the 
database  so  that  it  operated  correctly  in  the  sense  that  the 
data  would  be  consistent  and  tiiat  the  users  would  get  the 
results  that  could  be  expected  by  some  universal,  omniscient 
observer  who  could  see,  in  a  single  frame  of  time  reference, 
all  of  the  activity  tliat  was  relevant  to  the  database  system. 
The  mechanisms  being  proposed  were  complex  in  detail  and  it 
was  difficult  to  establisii  any  common  ground  among  them  to 
serve  as  some  basis  for  comparisons. 

The  variety  and  complexity  of  the  proposed  management 
mechanisms  and  the  lack  of  published  comparisons  for 
cost/ per formance  attracted  this  author's  attention  and 
provided  the  direction  for  this  research.  The  question  of 
particular  interest  was,  is  there  a  specific  management 
alternative  whose  improved  cost/ per formance  can  be 
demonstrated  in  an  appropriate  environment? 

2.U  DtVKLOPMENT  OF  THE  PROBLEM 

One  way  to  develop  a  management  alternative  is  to  start 
from  a  different  point  of  view  about  what  the  rules  of 
operation  ought  to  be.  Database  systems  have  always  been  set 
up  to  provide  users  with  data  values  which  represent  the  most 


recent  information  available  when  the  retrieval  request  was 
made  or  received.  In  frequently  updated  mul t i pi  e-copy 
distributed  database  systems,  however,  it  will  cost  both 
processing  overhead  and  communication  delay  to  handle  the 
complexity  of  keeping  multiple  copies  synchroni zed .  If  we 
could  reduce  the  processing  and  common ication  requirements  by 
not  keeping  all  of  the  copies  completely  up-to-date,  we  could 
improve  the  response  time  and  reduce  the  amount  of  message 
traffic.  This  alternative  would  be  appropriate  if  the 
database  users  would  be  willing  to  accept  less  recent 
information  that  could  be  gotten  quickly  and/or  cheaply,  and 
would  be  willing  to  wait  for  the  (probably!  more  expensive  and 
time-consuming  retrieval  of  the  most  recent  data  when  it  was 
necessary . 

The  assumptions  on  which  such  an  alternative  approach 
would  be  based  are  that: 

•  access  to  local  data  is  quicker  than  access  to  remotely 
stored  data,  and 

•  access  to  local  data  is  cheaper  than  access  to  remotely 
stored  data. 

The  first  assumption  is  true  for  distributed  systems  which  are 
made  up  of  nodes  geographically  spread  out  and  connected  by 
"thin-wire”  communications  networks,  like  the  ARPAnet 
[DAVI73]|  that  have  limited  bandwidth  and  provide 
communications  between  nodes  that  are  slow  relative  to 
communications  within  a  node.  The  second  assumption  is  true 
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tor  the  kinds  of  network  pricing  policies  which  are  prevalent 
today,  such  as  for  packet-switching  nets  like  Telenet 
[S0L073]. 

To  summarize  briefly,  the  specific  context  and  potential 
payoffs  for  the  research  of  this  dissertation  consists  of: 

•  multiple  copies  of  data  for  improved  availability  and 
reliability, 

•  multiple  sites  of  control  for  improved  availability  and 
reliability,  and 

•  geographic  separation  of  nodes  for  improved  system 
cost/ per formance  tiirough  reduced  contention, 
communications  cost,  and  delay. 

Within  this  context,  the  particular  goals  are  to  develop  a 
management  alternative  and  demonstrate  its  improved 


cost/ per formance. 
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Chapter  3 

AN  ALTERNATIVE  MANAGEMENT  APPROACH 

3.  1  THE  GENERAL  SETTING 

In  order  to  be  specific  about  the  proposals  of  this 
dissertation,  we  will  first  look  at  two  potential  examples  to 
give  us  the  context  within  which  to  describe  the  new 
alternative  approach. 

3.1.1  Airline  Reservation  System 

An  airline  reservation  system  handles  flight  scheduling, 
passenger  check-in,  and  airline  management  as  well  as  seat 
reservations.  The  basic  airline  reservation  system  in  use 
today  is  a  large  centralized  database  system.  Redundant 
hardware  at  the  central  site  ensures  minimal  possibility  of 
the  system  being  down  due  to  failures.  Terminals  from  all 
over  the  country  access  the  central  database,  usually  through 
a  regional  concentrator.  Terminal,  or  concentrator,  service 
is  provided  by  polling  from  the  central  site.  Special 
protocols  have  been  defined  so  that  various  networks  can  be 
hooked  together  for  wider  applicability.  For  example,  the 
SITA  (Societe  Internationale  de  Telecommunications 


Aeronautiv^ue''  network  wos  estab  1  i  sh’vl  (19'^9)  as  a  world-wide 
low-speed  message  switching  network  [DAVI7.U.  When 
reevaluated  in  19bi4,  SITA  was  reorganized  by  establishing 
high-level  network  centers  to  join  areas  of  smaller  mess  ige 
oonc en tr at  ion ,  so  that  a  network  of  star  networks  was  formed. 
New  York,  London,  Paris,  Amsterdam,  Brussels,  Rome,  Frankfurt, 
and  Madrid  were  chosen  as  high-level  center's.  The  high-level 
network  (completed  in  19 ’0)  was  designed  for  and  has  achieved 
faster,  cheaper  common icat ions  service. 

A  possible  distributed  database  system  for  airline 
reservations  would  have  one  file  per  flight  with  an  identifier- 
consisting  of  the  flight  number  and  date.  Copies  of  the  file 
would  be  located  at  regional  reservation  centers  (for  the 
U.S.,  probably  about  six  of  them:  north-  and  south-  east, 

central,  and  west)  instead  of  just  having  regional 
concentrators.  One  goal  would  be  to  minimize  response  time 
and  communications  cost  between  numerous  inquiry  terminals  and 
the  distributed  database  (DDB).  Some  time  after  successful 
flight  completion,  the  file  would  be  archived  and  the  copies 
deleted  from  on-line  storage. 

Other  types  of  reservation  systems,  such  as  for  hotel 
space,  theater  tickets  or  sports  programs  (with  wide  appeal 
such  as  the  World  Series),  could  be  set  up  similarly  to  the 
airline  system.  Minor  variations  in  policy  will  have  to  be 
checked  for  impact  on  the  DDBMS.  Over-booking  an  airline 
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flight  may  be  common  practice,  for  example,  while  selling  the 
same  seats  twice  for  a  play  is  not  tolerated. 

3.1.2  Equipment  Supply  System 

An  equipment  supply  system  handles  inventory  control, 
order  entry,  scheduling  of  deliveries,  and  aids  company 
management.  Existing  systems  tend  to  fall  into  two 
categories:  completely  centralized  or  master/slave  copies 

with  deferred  nightly  updating.  For  example,  a  completely 
centralized  system  is  used  by  J.I.  Case  [IBM  77],  manufacturer 
of  agricultural  and  construction  vehicles,  to  manage  its  spare 
parts  inventory  and  distribution.  The  system  consists  of  a 
central  site  and  85  terminals  in  assorted  warehouse  (one  main 
and  ten  regional)  and  office  (corporate,  distributors, 
dealers)  locations.  Order  entry,  order  inquiry,  purchasing, 
receiving,  depot  replenishment,  warehouse  control,  and 
management  forecasting  are  all  handled  on-line. 

An  example  using  nightly  synchronization  and  updating  of 
duplicate  information  is  the  hierarchical  network  used  by 
Celanese  to  handle  its  inventory  and  product  distribution 
[WATKTTl.  Two  large  central  computers  divide  up  the 
responsibility  of  being  master  by  partitioning  control  of  the 
data:  one  handles  inventory  control  and  bills  of  lading,  and 
the  other  handles  invoicing  and  accounts  receivable.  These 
two  are  both  connected  to  a  smaller  computer  which  controls  an 
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automated  warehouse  and  manages  a  set  of  small  production 
control  computers.  Updates  which  have  to  be  distributed  to, 
or  synchronized  among,  various  locations  in  the  network  are 
batched  for  nightly  shipment  both  up  and  down  the  hierarchy; 
e.g.,  after  all  data  are  updated  and  synchronized,  one  of  the 
central  computers  will  prepare  a  list  of  everything  to  be 
shipped  on  a  particular  day,  and  then  send  it  to  the  warehouse 
control  computer  for  coordination,  scheduling  and  handling. 

The  morf  general  distributed  system  being  suggested  here 
would  be  appropriate  to  a  large  company,  for  which  the  country 
would  be  divided  into  regions,  each  with  a  regional 
headquarters  and  sales  force.  Warehouses  would  be  responsible 
for  their  own  local  inventory  databases,  with  copies  located 
at  the  appropriate  regional  headquarters.  The  regional 
headquarters  would  also  maintain  sales,  customer,  and  delivery 
information  for  the  region,  and  the  sales  force  would  require 
copies  of  inventory  located  in  adjacent  regions. 

There  are  numerous  variations  possible  on  a  system  of 
this  basic  type.  Some  other  applications  would  include 
inventory  control,  accounting,  and  facilities  management  for: 
supermarket  chains,  large  multiple-plant  manufacturing 
operations,  department  stores,  multi-branch  libraries,  and 
utilities  such  as  electric  or  telephone  companies. 


j.1,3  Cost/Performance  Concerns 

In  considering  the  examples  from  a  cost/ per formance 
viewpoint,  we  immediately  notice  two  things: 

*  numerous  messages  will  be  required  to  keep  all  of  the 
data  copies  consistent  (so  that  users  accessing  different 
copies  concurrently  do  not  get  different  results,  for 
example),  and 

*  the  delays  involved  in  both  the  message  transmission  and 
the  processing  required  to  coordinate  the  activity  among 
the  nodes  may  be  significant  to  the  user. 

"Thin-wire"  communications  among  nodes,  i.e.,  limited 
bandwidth  that  is  slow  relative  to  communications  within  a 
node,  is  of  particular  interest  because  it  is  a  common 
characteristic  of  many  of  the  networks  existing  today.  Even 
as  special  leased  telephone  lines  and  satellite  systems  begin 
to  offer  alternatives  to  old  speed  and  bandwidth  restrictions, 
their  cost  will  continue  to  make  them  impractical  for  many 
companies  and  applications,  at  least  over  the  next  five  years 
[MAND78]. 

The  major  database  communications  bottleneck  within  a 
node  is  usually  between  memory  and  disk  storage.  It  is 
described  by  two  parameters:  set-up  or  access  time  and 

sending  or  transfer _ r ate .  The  large  disks  suitable  for 

database  storage  typically  provide  average  access  times  on  the 


order  of  20  msec  and  transfer  rates  on  the  order  of  1  miorosec 


per  byte.  In  situations  whore  disk  utilization  Ret.s  higli 
enough,  contention  may  be  manifested  in  long  waits  Just  to 
initiate  access. 

Communications  among  network  nodes  can  also  be 
characterized  by  access  time  and  transfer  rate,  but  now 
transit  times  also  become  appreciable.  Current  long-range 
communication  (b  -  bO  Kilobaud)  provides  transfer  rates  of 
roughly  .2  -  2  msec  per  byte,  and  transit  times  (e.g.,  across 
the  U.S.)  on  the  order  of  10  msec  per  byte.  Remote  access 
time  is  defined  as  the  interval  between  initiation  of  the  call 
for  the  remote  data  and  transfer  of  the  first  byte.  Typical 
times  are  10  to  100  times  slower  than  local  disk  accessing 
(for  example,  ARPAnet  times  to  establish  a  virtual  circuit). 
This  gives  a  rough  estimate  for  remote  access  time  that  is  on 
the  order  of  a  second.  Computer  Corporation  of  America  is  in 
the  process  of  designing  a  distributed  database  system  that  is 
based  on  a  similar  assumption  that  access  to  the  first  byte  of 
data  at  a  remote  node  will  be  bO  to  100  times  slower  than 
access  to  data  on  a  local  disk  [WONG771. 

We  can  summarize  our  concern  over  communications  and 
coordination  costs  and  delays  with  two  quantities:  the  number 
of  network  messages  to  complete  a  user  transaction  represents 
the  cost  aspect  and  the  response  time  which  the  user  sees 
(between  completing  his  request  and  receiving  his  reply) 
represents  the  performance  aspect. 
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^  TKAOKOFh'  IN  TlMKl.lNKSS 


j .  c' .  1  l)e  1'  i  n  i  1 1  ons 

I. el  us  assume  Ih.tl  I'oi'  oeft.iiu  appl  ie.it  lotn  .1  i  ■«  i  i- i  Imi  t  f*<1 
dala  eopies  woul>1  not  tuive  to  lie  vmme.1  \  it  e  I  y  s  yn»'tu'on  i /e.t  ,  .in-t 
pt'opose  Hull  Ihet'erot  e  the  usees  of  the  -t at  .iP  iso  woul.1  net 
vtuioker  response  t  ime  utul  the  oommunie.it  i  m--  I'o.st  s  in  tin 
system  oouKi  he  lower.  Ttie  improvement  in  resp.Mi.se  t  ime  wou 
be  b.ised  on  the  premise  that  u,s<'rs  woul<l  I’e  will  inn  to  w  M-k 
with  old  data  (out  of  d.ite  with  respeet  I'’  '.  iie  .t.i’.  >1'  is  .i 

whole)  if  they  could  net  it  ijuieklv.  i  or  r  espond  i  nn  I  v  ,  t  lie 
lower  o  ommun  i  e  .1 1  i  on  s  cost  would  lu'  b.ise.l  .'ti  the  premise  that 
periodic  batched  t  r  .insm  i  ss  i  iin  of  upd.it  es  w.nild  be  ehe.iper  th.in 
hatuilinn  each  one  .is  it  oeeuri.  both  of  these  premises  .are 
supported  by  experlenee  with  exist  inn  een  t  r  .i  1  1 /ed  datab.ase 
sy.stems  such  .is  the  Retail  Stv>re  System  by  lltM  iMfKNrsl,  where 
upd.ites  are  collected  throunhout.  the  d.iy  at  Individual  stores 
and  then  batched  to  company  he.nhiuar ters  at  ninht.  Similarly, 
Celanese  [WArKVYl  does  their  sytichron  i  za  t  ion  dally  rather  that 
sendinn  current  updates  throunh  the  network  durlnn  the  day. 
If  .ibsolutely  up-to-date  inform.ition  is  required,  it  must  be 
handled  by  maklnn  a  telephone  call  to  a  person  in  the 
appropriate  warehouse. 
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Wt'  will  explioiLly  clofitu'  "ilat.i  timeliness"  as  how  recent 
(or  how  old)  the  i  n  I’o  rm.it  i  on  is.  By  associating  a  timestamp 
with  every  data  item  as  it  is  entered  into  the  database 
(insertions  and  updates),  we  will  he  able  to  quantify  the  data 
timeliness  as  the  difl'erence  between  current  time  and  the  item 
timestamp.  In  this  way,  data  timeliness  is  an  intrinsic 
attribute  of  a  data  item  which  depetids  on  the  context  from 
which  it  is  referenced.  We  will  also  define  "access 
timeliness"  to  indicate  how  quickly  the  information  is  needed 
in  order  to  be  useful  to  the  one  who  requested  it.  The  user 
will  subjectively  define  some  time  interval  within  which  he 
needs  the  information  in  order  for  the  access  to  be  timely. 


CH-BI  Hetrievals 


The  tradeoff  between  data  and  access  timeliness  is  made 
available  to  the  user  as  two  types  of  retrievals: 

*  quick  response  (QR)  retrievals  for  the  user  who  values 
access  timeliness  more,  and 

•  best  information  (BI)  retrievals  for  the  user  who 
requires  the  best  data  timeliness. 

Interactive  users,  for  example,  who  want  quick  response  for 
retrievals,  and  are  willing  to  deal  with  data  values  that  may 
not  reflect  the  most  recent  update  cycles,  are  served  quickly. 
In  contrast,  batched  programs  which  can  easily  (or 
transparently,  at  least)  afford  to  wait  for  the  most  recent 
data  values  to  be  found  get  the  best  information  possible.  QR 

_ _  .  . .  ....  -  r.^.. 


users  will  be  niveu  the  .iilJition.il  opt.ioti  (If  included  at 
database  desi^^n  time)  of  determining  the  timeliness  of  their 
retrieved  dat.j  by  looking  at  the  associated  timestamps. 

The  operating  difference  required  by  the  users  accessing 
the  distributed  dat.ibase  is  minimal;  instead  of  having  simply 
a  HtTHlKVK  operatioti,  there  would  be  (at  least)  two  options: 
RKTHli-VK  OUlCK  (or  RKTRlKVb;  JR)  and  RKTRIKVK  HKST  (or  RKTRIKVK 
BI).  An  intermediate  option  ol'  RKTRIKVE  BEST  WITHIN  (time 
limit)  would  also  be  possible,  but  could  he  more  complicated 
to  iiandle:  first  a  definition  of  wtiat  "best"  means  within  a 
time  limit  would  be  needed,  and  then  a  strategy  for  how  to 
determine  or  calculate  it  would  liave  to  be  set  up.  Standard 
defaults  could  be  arranged,  so  tliat,  for  example,  Interactive 
users  would  get  QR  unless  tiiey  specified  Bl,  and  batch  uses 
would  get  Bl  unless  JR  was  explicitly  requested. 

To  deal  witii  the  retrieval  options  most  productively,  a 
user  must  understand  sometlUng  about  t  lu>  possible  responses  to 
his  I’oquests.  A  JR  request  may  return  the  best  information  if 
the  local  copy  happens  to  be  the  last  one  updated.  A  Bl 
request  may  not  return  tlie  absolute  best  information,  in  the 
context  of  the  distributed  dat.ibase  as  a  whole,  if  some  node 
is  down  or  too  busy  to  reply;  t(ie  best  information  available 
wiien  tile  request  is  processed  (that  is,  the  relative  best) 
will  be  what  is  returned.  On  the  otlier  hand,  the  intermediate 
option  of  best  within  a  time  limit  may  produce  no  result  at 


22 


all  If  the  data  are  not  local  and  the  limit  is  reached  too 
quickly  for  a  remote  result  to  have  arrived.  The  user  could 
then  be  given  a  choice  of  removing  the  limit  or  restating  his 
request  (e.g.,  continue  waiting  for  the  result  to  arrive  or 
cancel  the  request  by  throwing  away  the  result  when  it  does 
arrive).  For  simplicity,  this  dissertation  will  deal  only 
with  the  extreme  options,  QR  and  BI. 

It  is  important  to  remember  that  this  tradeoff  between  QR 
and  BI  is  a  retrieval  option  and  does  not  affect  the  updating 
policy  except  from  a  performance  standpoint.  If  it  is 
important  to  the  users  of  a  particular  application  to  know 
with  complete  certainty  that  their  updates  have  been  properly 
applied,  the  database  management  system  will  have  to  do 
whatever  locking,  coordination,  and  verification  is  required. 
The  only  problem  is  that  in  geographically  dispersed  systems, 
such  activity  may  involve  many  resources  and  require 
appreciable  amounts  of  time. 

Some  simple  examples  of  the  differences  between  QR  and  BI 
requests  may  help  to  motivate  the  value  of  the  tradeoff 
involved.  In  an  airline  reservation  system  a  manager  might 
want  to  know  approximately  how  many  seats  were  already  booked 
on  a  particular  flight  for  next  month.  A  QR  retrieval  would 
be  appropriate  for  the  "approximately”  and  the  manager  would 
receive  his  answer  quickly  from  local  data.  If  Joe  Green 
wanted  to  know  whether  his  secretary  had  properly  booked  him 


on  that  f'll^^ht,  a  IH  retrieval  wouKl  get  him  the  answer  with 
certainty.  Similarly,  a  regional  salesman  asking  for  general 
information  purposes  how  m.niy  units  of  a  particular  type  were 
available  in  the  warehouse  closest  to  tiim  wouKl  use  QK  atni 
generate  as  little  liatabase  traffic  atnl  contention  as 
possible.  The  c  ircumstances  sur  rvuitul  i  ng  an  indivi^iual  re(.juest 
cat)  have  a  very  definite  effect  on  whicti  type  of  retrieval  is 
appropriate.  For  example,  a  ^^uestion  about  wlii*ther  .i 
particular  order  had  been  filled  and  actually  shipped  could  bo 
handled  by  UK  and  a  timestamp  if  the  salesman  is  checking  for 
his  own  information,  but  had  bettor  be  retrieved  as  hi  when 
the  customer  is  waiting  on  t  tie  telephone  to  find  out. 


1 . 4? .  j  Pat  a  Quality 

Since  we  expect  retrieval  situations  where  the  number  of 
UR 3  far  outweigh  the  number  of  his  to  give  us  a  significant 
cost/ per formance  improvement  for  the  distributed  database 
system  as  a  whole,  we  would  like  to  provide  even  more 
incentive  for  the  use  of  UK  retrievals.  The  timestamps 
associated  with  each  data  item  can  sometimes  be  used  to 
establish  the  appropriateness  of  a  particular  data  value 
retrieved  with  UK  by  defining  its  timeliness.  Suppose  we  take 
a  broader  view  and  consider  something  we  will  call  data 
quality.  This  would  be  an  indic.ition  v^f  what  changes  had 
taken  place  since  the  local  copy  of  the  data  item  had  last 
been  updated.  Pata  quality  would  be  most  appropriate,  then. 
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when  associated  with  a  fairly  large  unit  of  data,  so  that  the 
overhead  of  its  storage  would  remain  relatively  small.  This 
is  in  contrast  with  the  timeliness  approach  of  attaching  a 
timestamp  to  the  smallest  unit  of  data  which  can  be  updated. 
The  issue  of  granularity,  i.e.,  the  question  of  what  size 
unit,  is  similar  in  both  cases.  The  maximum  information  is 
gained  with  the  smallest  unit  association,  but  at  maximum  cost 
in  storage.  The  timeliness  approach  assumes  a  willingness  to 
pay  the  storage  cost  for  the  timestamps;  the  granularity  for 
the  quality  indicator  can  be  considered  separately,  and 
depends  specifically  on  the  type  of  data  and  type  of 
application  involved. 

In  the  distributed  database  context,  when  a  user  closes 
or  releases  a  file  (i.e.,  the  subset  of  the  database  being 
worked  with)  that  he  has  updated,  he  could  be  asked  by  the 
DDBMS  for  an  evaluation  of  the  importance  of  the  changes  to 
the  file  during  that  session.  The  DDBMS  will  add  a  timestamp 
to  that  evaluation  to  create  the  "quality"  information  to  be 
associated  with  that  update  session  on  that  file.  Such  a 
quality  indicator  could  be  sent  by  the  DDBMS  to  any  copies  of 
the  file  which  are  not  being  synchronously  or  immediately 
updated . 

The  value  of  data  quality  indicators  to  users  of  the  data 
other  than  the  updating  one,  depends  on  a  number  of 


considerations  such  as: 


the  trustworthiness  of  the  one 


jssigiiirig  the  e  V  a  1  u.i  t  i  on  ,  the  me>nu  ngt'ul  ness  of  the  timestamp 
outsUle  Its  ftame  of  nefenetu'e  tif  node  eloeWs  ane  not 

syiiohron  i  ze  J  in  some  way,  they  nepnesent  eompietely 
independent  frames  of  time  r  e  f  ei‘ eno  e '  ,  the  de^tree  of  a^treement 
amotift  the  user  oommunity  on  what's  important,  and  even  the 
relationship  between  wtiat  eonstitutes  a  revision  and  tlie 
number  of  update  sessions  re^iuired  to  pr'oduoe  that  revision. 
These  problems  emphasise  t  lie  diffieulties  inherent  in  ttie 
paradox  ioal  attempt  to  v^uantify  vjuality. 

Airline  Keservation  Kxample, 

A  simple  way  to  use  data  ^;uality  itidioatoi's  in  the 

.lirline  reservation  system  would  be  to  assooiate  with  eaeh 
fli(tht  file  a  eount  of  how  many  seats  remain  unreserved 
ttirou^ttiout  t  tie  network.  Then,  even  if  ttie  loeal  file  oopy  did 
not  iiave  all  of  ttie  i  n  f  oi'm.it  i  v'ti  re^tardintt  eaoti  reservation 

Itiat  tiad  been  maile,  new  r'eserv  at  i  ons  oould  be  made  witli  less 

otianeo  of  overbookiiij;  ttie  flif;tit.  The  detail  of  sueti  data 
itU'ility  use  revjuires  more  information  about  tiow  ttie  system 
operates  t  see  seotion  Ul..'.."*. 

Kquipment  Supply  f.x.imple. 

A  similar  use  of  data  quality  would  be  appropriate  in  an 
equipment  supply  system.  If  loo.il  oopies  of  inventory  files 
for  warehviuses  not  in  the  loeal  re^tion  were  not  kept 


immediately  up  to  date,  the  data  quality  indicator  could  be 
used  just  to  signal  that  update  activity  had  occurred  on  a 
particular  item.  For  more  detail,  see  section  3-3-2. 3. 

3-2. 3-3  Text  Example. 

A  very  detailed  example  of  the  use  of  data  quality  which 
can  be  developed  with  no  additional  operational  rules  is  for 
text  editing,  where  updating  activity  usually  produces 
distinct  versions  of  the  text  file.  I  might  give  you 
yesterday's  version  of  the  text  to  read,  telling  you  that  even 
though  it  is  not  a  copy  of  the  most  recent  version,  the 
changes  that  were  made  were  all  minor  editing  changes  which 
had  little  or  no  effect  on  the  text  content  and  meaning.  You 
would  probably  be  willing  to  read  and  comment  on  that  old 
version,  with  a  certain  amount  of  confidence  that  your  copy  is 
a  reasonable  representation  of  the  actual,  current  text. 

A  data  quality  indicator  (DQI)  for  text  files  can  be 
defined  as  a  timestamp  (TS)  and  a  sever ity-of-change  (SOC) 
code:  DQI  =  (TS,  SOC),  where  the  SOC  code  indicates  how  many 
changes  were  made  and  how  important  they  were  considered  by 
the  user  who  made  them.  One  way  to  implement  the  code  as 
three  digits  is  shown  in  Table  3-1 •  Let  us  consider,  for 
example,  an  author  editing  a  textbook  where  the  data  unit  is 
the  whole  book.  For  rewriting  an  entire  chapter,  he  might 
assign  a  SOC  code  of  133  to  mean  one  large  chunk  was  changed 
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significantly; 

for  a  section,  122  or 

123 

to 

mean 

one  medium 

chunk  changed 

a  medium  or  major  amount; 

and 

the 

t imestamps 

would  indicate 

when  the  last  update 

of 

the  revision 

was  made. 

For  changing 

all  words  "which" 

to 

"that 

",  the 

SOC  code 

probably  would  be  311.  The  particular  advantages  to  this 
scheme  are  its  simplicity  and  its  brevity  (only  four  or  six 
bits  are  needed  to  hold  the  code) . 

It  is  not  likely  that  this  type  of  SOC  code  can  be 
generated  automatically  by  the  DDBMS  during  an  update  session. 
Changes  could  be  counted,  but  chunk  size  and  importance  would 
be  too  difficult  to  keep  track  of  and  to  evaluate.  Despite 
such  philosophical  considerations,  we  can  see  that  a  data 
quality  indicator  would  be  of  particular  use  for  any  files 
that  are  not  updated  immediately  in  a  distributed  database. 
The  DQI  would  be  transmitted  at  the  end  of  each  update  session 
to  the  remote  copy  sites  and  added  to  the  file  header  as  part 
of  a  change  history  until  the  copy  actually  receives  the 
update  set  summarized  by  the  DQI.  There  will  be  a  definite 
tradeoff  to  be  examined  between  the  length  of  the  history  in 
the  file  header  and  the  delay  period  of  the  updating.  In 
order  to  be  certain  when  the  DQI  could  be  erased,  its 
timestamp  should  be  that  of  the  concluding  update  transaction. 
When  the  copy  site  later  receives  a  set  of  updates  with 
timestamps  up  to  and  including  that  of  the  DQI,  the  DDBMS  will 
know  which  DQI  is  no  longer  needed. 
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Table  3- 

1 .  Severity 

-of -change 

soc 

=  (  CNO, 

CSZ,  IMP  ), 

where 

CNO: 

number 

of  data  Chun 

ks  changed, 

CSZ  : 

size  of 

data  chunks 

changed  , 

IMP: 

imports 

nee  of  Chang 

es . 

digits 

values 

CNO 

CSZ 

IMP 

1 

one 

small 

minor 

2 

a  few 

medium 

med ium 

3 

1 

many 

large 

major 

•  a  value  of  0  for  any  digit  means  it  was  not  specified 


3.3  DELAYED  SYNCHRONIZATION  MANAGEMENT 

3.3.1  A  New  Management  Approach 

In  addition  to  the  timeliness  tradeoff,  we  will  need  the 
following  assumptions  in  order  to  proceed  with  the  proposal  of 
an  alternative  management  scheme  that  delays  synchronization 
in  a  frequently  updated,  multiple-copy  distributed  database: 

•  the  cost  for  a  single  batch  of  N  messages  is  less  than 
the  total  cost  for  N  individual  messages,  and 
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•  data  copies  can  be  allowed  to  diverge  (i.e.,  become 
inconsistent)  because  BI  retrievals  can  always  find  the 
most  recent  value  for  any  data  item  whenever  all  the 
network  nodes  are  up. 

Delayed  synchronization  management  consists  of  handling 
updates  immediately  at  the  site  of  origin  and  not 
synchronizing  them  with  any  other  data  copies  in  the 
distributed  system  until  after  some  delay  period.  The  length 
of  the  delay  will  determine  how  much  the  copies  diverge 
according  to  how  many  upd'ates  come  in  and  to  what  sites.  The 
specific  objective  is  to  improve  the  average  system 
cost/performance  by  delaying  some  of  the  expensive  and 
time-consuming  resource  requi^'ements  (to  synchronize  the 
updates  among  all  data  copies)  until  less  busy,  less  expensive 
times . 

3.3*2  Fundamental  Notions 

The  first  step  in  a  timeliness  approach  to  delaying  the 
synchronization  of  multiple  copies  in  a  distributed  database 
is  to  classify  the  multiple-copy  files  according  to  their 
requirements  for  update  application  and  synchronization.  This 
could  be  a  static  classification,  by  the  users  and  the 
database  administrator  at  design  time,  or  some  conditions 
could  be  specified  for  dynamic  classification.  The  classes  of 
files  are  those  for  which: 
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(1)  updates  are  applicable  at  all  copies  as  soon  as 
possible  and  must  be  synchronized  ,  called  "synch"  files; 

(2)  updates  are  applicable  to  all  copies  but  need  not  be 
immediately  synchronized,  called  "delsync"  files;  and 
(i)  updates  are  applicable  to  only  one  primary  copy  and 
are  subsequently  indicated  (e.g.,  broadcast)  to  all 
back-up  copies,  called  "master/slave"  files. 

Files  of  tne  latter  two  types,  delsync  and  master/slave,  will 
allow  several  update  policy  options  such  as  immediate, 
delayed,  or  periodic  transmission  of  updates,  with  possible 
intermediate  transmission  of  some  sort  of  data  quality 
indication.  All  updates  will  eventually  be  done  to  all  copies 
in  order  to  provide  the  long-term  mutual  consistency,  with  the 
delay  period  depending  on  the  particular  application.  Notice 
that  the  word  "file"  here  (and  throughout  this  dissertation) 
refers  to  some  identifiable,  replicated  subset  of  the  whole 
database  (such  as  a  single  relation  or  a  Codasyl  area).  It 
does  not  necessarily  imply  some  particular  storage  or  header 
structure,  or  even  a  data  model  scheme  (hierarchical, 
relational,  network). 

The  second  step  of  a  timeliness  approach  is  to  classify 
the  DDB  uses  according  to  whether  data  timeliness  or  access 
timeliness  is  more  important.  That  is,  divide  the  uses  or 
users  into  groups  requiring  or  desiring  quickest  response  as 
opposed  to  best  information.  This  phase  could  be  handled  in 
several  different  ways,  including: 
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(1)  preclassify  all  known  transaction  types  at  design 
time  (this  is  done  in  SDD-1,  see  section  A. 7. 2  of 
appendices) ; 

(2)  preclassify  users  at  system  entry  time  as  to  whether 
they  want  quick  response  or  will  wait  for  latest 
information ; 

(3)  keep  user  profiles  and  classify  users  according  to 
their  use  history; 

(4)  allow  dynamic  specification  of  these  extreme  options 
and  some  intermediate  options  as  well:  respond  within 
some  time  limit,  or  return  the  best  information  which  can 
be  found  within  some  time  limit. 

3. 3. 2.1  Ordinary  Operation. 

Delsync  files  will  be  appropriate  only  when  the  delay  in 
synchronization  will  not  jeopardize  the  long-term  consistency 
of  the  database.  Whenever  an  update  for  a  delsync  file  is 
received,  the  DDBMS  must  check  to  see  whether  application  of 
the  update  would  violate  any  consistency  constraints  (such  as 
not  overbooking  an  airline  flight).  If  constraints  would  be 
violated,  then  synchronization  must  be  initiated  and  the 
update  held  off  until  all  copies  are  brought  up  to  date  and 
the  file  is  changed  to  synchronized  mode. 

If  no  constraint  would  be  violated,  we  can  define  two 
types  of  update  policies:  immediate  and  delayed.  In  the 
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immediate  case,  the  update  would  be  applied  locally  and  sent 
to  othei'  copies,  where  we  again  have  a  choice  of  .ipplying  the 
update  either  immediately  or  after  a  delay.  belay  local  to 
the  update  originator  tends  to  minimi/.e  the  average 
communications  cost  and  delay  while  risking  an  occasional 
penalty  if  some  user  requests  a  BI  retrieval.  Delay  in 
applying  at  the  copies  once  the  upd.ite  has  been  received  may 
be  used  to  keep  disk  contention  minimal  during  periods  of  high 
activity. 

For  updating  delayed  at  the  originator,  the  update  would 
be  saved  locally  after  local  application  and  sent  out  to  the 
other  copies  after  a  delay  period.  A  technique  called 
differential  files,  commonly  used  for  file  storage,  back-up 
and  recovery  LVEKH78],  will  conveniently  handle  both  the 
update  saving  and  the  access  to  the  file  so  that  updated 
information  is  properly  retrieved.  If  updating  were  not  too 
frequent  or  if  the  delay  period  were  long,  an  indicator  such 
as  data  quality  could  be  sent  to  the  other  copies  to  show  that 
updates  had  occurred.  The  appropriateness  of  sending  such  an 
indication  depends  on  the  frequency  and  size  of  the  updates, 
and  on  the  granularity  of  the  data  aggregate  to  which  the  DQI 
applies.  For  small,  frequent  updates,  the  indicator  traffic 
would  be  about  the  same  as  sending  the  updates  themselves  and 
would  completely  destroy  the  advantage  of  saving  the  updates 
for  a  big  block  transmission.  Intermediate  solutions  could  be 
created  by  batching  the  update  transmissions  more  frequently, 
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such  as  during  a  particular,  agreed-upon  five  minute  period 
out  of  each  hour  of  operation. 

Retrievals  from  delsync  files  would  be  answered  locally 
unless  the  information  was  not  held  locally  or  a  specific 
request  was  received  for  best  information.  For  Bl,  the 
retrieval  would  have  to  be  sent  to  all  file  copies,  the 
results  collected  together,  and  the  latest  information 
selected  from  among  the  responses.  The  user  would  have 
explicitly  given  up  quick  access  in  favor  of  best  data. 

In  general,  the  approach  is  to  preclassify  the  types  of 
data  according  to  the  probability  of  conflicting  updates  and 
the  penalty  for  old  data  or  possible  errors.  For  files  which 
seem  to  qualify  as  delsync,  constraints  are  specified  to 
minimize  the  chances  of  errors  from  conflicting  updates,  and 
error  detection  is  coupled  with  correction  or  notification 
procedures.  If  the  file  mode  is  to  change  dynamically,  change 
conditions  and  procedures  must  be  established,  as  suggested 
above. 

In  spite  of  additional  complexity,  the  distributed  file 
copies  give  good  data  availability  and  quicker  response  to 
many  retrievals.  The  chief  advantage  over  a  centralized 
system  is  cheaper,  quicker  average  retrieval  since  total 
network  traffic  and  queuing  waits  are  reduced.  For 
interactive  management  queries,  the  quick  look  at  local  copies 
of  information  and  the  associated  quality  indication  is 


34 


adequate  for  nearly  all  requests  and  again  saves  network 
traffic,  lowering  associated  communications  costs. 

3. 3. 2. 2  Airline  Reservation  System  Example. 

A  classification  at  design  time  of  the  files  for  an 
airline  reservation  system  may  be;  near-full  flights  as  synch 
type,  near-empty  flights  as  delsync  type,  and  checking-in 
flights  as  master/slave  type.  It  is  easy  to  see  that 

reserving  seats  on  nearly  full  flights  needs  to  be  coordinated 
among  file  copies  so  that  the  flight  is  not  over-booked. 
Reserving  seats  on  near-empty  flights  can  be  done  in  parallel 
on  the  multiple  file  copies  with  periodic  (probably  nightly) 
merging  of  all  updates  into  all  copies.  This  works  well  until 
the  number  of  reservations  exceeds  some  flight  threshold.  For 
example,  suppose  the  threshold  were  set  at  50%  full.  Any 

reservation  which  would  cause  the  local  file  to  exceed  the 
threshold  would  require  all  copies  to  synchronize  before  the 
reservation  could  be  granted.  The  file  would  then  become  a 
synch  type  and  all  further  reservations  would  be  synchronized 
among  all  file  copies.  The  particular  advantage  to  this 
approach  is  to  drastically  reduce  the  average  amount  of 

network  traffic  (by  updating  delsync  files  locally  and  only 

merging  during  periods  of  low  activity)  so  that  quick  response 
from  the  fewer  synch  files  can  be  provided  and  communications 
cost  can  be  reduced.  Thus,  updates  on  nearly-full  flights 
close  to  departure  time  can  still  be  handled  more  quickly  than 


in  centralized  or  fully  synchronized  types  of  systems. 


We  can,  of  course,  construct  pathological  cases  of 
thresholds,  block  reservations ,  and  concurrent  updates  of 
distinct  copies  which  will  overbook  the  flight: 

100  seats  on  the  flight, 
total  of  12  prior  reservations, 
threshold  of  bO, 
i  file  copies , 

3  simultaneous  requests  for  blocks  of  30  seats  each, 
yielding  allocation  of  102  seats  without  even  generating  a 
synchronization  requirement  at  any  file  site.  Thresholds  will 
have  to  be  chosen  through  experience  so  as  to  minimize  the 
probability  of  such  incorrect  file  operations  by  taking  into 
account  both  the  seating  capacity  and  the  overbooking 
probability  or  penalty.  A  crude  threshold  value,  for  example, 
could  be  set  by  dividing  the  total  number  of  seats  by  the 

number  of  reservation  sites;  for  most  files  this  is  too 

conservative  an  approach.  Another  possibility  is  to  add  a 
second  threshold  on  the  size  of  block  requests,  which  could  be 

dependent  on  both  the  number  of  seats  still  available  and  the 

number  of  reservation  sites  in  the  network. 

In  addition,  a  method  for  recovering  from  any  error  must 
also  be  provided.  In  the  example  here,  flight  over-booking  by 
two  seats,  the  airline  may  choose  to  leave  the  conflict  and 
count  on  subsequent  cancellations  or  no-shows  to  resolve  it. 
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In  any  case,  the  DDBMS  will  have  to  provide  some  type  of 
notification  (such  as  a  terminal  alert  to  an  operator)  to 
establish  awareness  of  the  error  situation.  Typically  the 
error  will  be  discovered  during  the  nightly  merge  operation  if 
the  file  is  still  in  delsync  mode,  or  during  the  transition 
from  delsync  to  synch  mode.  In  either  case,  the  error  will 
appear  within  2^  hours.  To  ensure  that  it  also  shows  up  at 
least  several  days  before  the  flight,  transition  can  be  forced 
(regardless  of  any  threshold)  some  minimum  number  of  days 
before  the  flight.  Such  forced  transition  has  the  additional 
benefit  that  no  update  will  have  to  wait  while  the  transition 
takes  place. 

A  file  will  be  changed  to  master/slave  type  when 
passengers  begin  checking  in  for  the  flight.  The  file  copy 
closest  to  the  check-in  point  becomes  the  master  and  all 
further  updates  must  go  through  the  master  for  transmission  to 
the  slave  copies.  This  will  ensure  uniqueness  of  seat 
assignments,  accuracy  of  passenger  lists,  and  proper  passing 
of  through-routing  confirmations  for  connecting  flights. 

Classification  of  uses  in  the  distributed  reservation 
database  shows  that  reservations  for  near-empty  flights  get 
the  quickest  possible  response  from  local  file  copies,  while 
reservations  for  near-full  flights  are  properly  synchronized 
without  resource  competition  from  the  less  urgent  requests. 
Management  statistics  and  planning,  usually  interactive 


terminal  operations,  will  desire  quick  response  --  (as  long  as 
archived  data  is  not  requested)  retrievals  can  be  all  from 
local  copies,  whether  synch,  delsync  or  slave.  Report 
generation  requires  accurate  information,  but  as  a  batched 
operation  usually  run  in  the  background,  it  can  afford  to  wait 
for  any  information  to  be  gathered  either  from  delsync  copies 
or  from  archived  files. 

The  object  of  all  this  classification  is  to  allow  some 
flexibility  in  DDBMS  strategies  for  file  update  and  retrieval 
in  order  to  improve  average  response  time  as  much  as  possible 
and  cut  communications  costs.  Near-empty  flight  files  will  be 
handled  by  periodic  update  transmissions  among  the  copies, 
while  near-full  flight  files  will  be  synchronized  by  an 
appropriate  protocol.  Quick  response  management  queries  will 
be  answered  strictly  from  local  data  without  waiting  for 
communications  from  other  nodes,  while  report  generation  will 
wait  to  gather  the  latest  information  from  all  locations 
containing  it. 

3.3.2.j  Equipment  Supply  System  Example. 

File  classifications  for  equipment  supply  are  of  two 
types,  static  and  dynamic.  Warehoused  inventory  will  be 
preclassified  and  remain  master / si av e ,  as  will  production 
(manufacturing)  records.  Installed  inventory  records  (e.g., 
for  rented  equipment)  and  customer  information  (e.g.,  address) 
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will  be  delsync,  as  the  low  rate  of  update  activity  has  low 
probability  of  conflicts  occurring.  Billing  and  accounting 
files  will  require  synchronization  for  accuracy  and 
timel iness . 

Dynamic  file  classification  applies  to  equipment  orders 
and  service  requests.  Orders  being  entered  and  checked  for 
consistency  may  be  delsync  and  change  to  synch  as  they  are 
assigned  production  line  positions  and  warehoused  equipment. 
After  assigned  equipment  is  produced,  as  it  is  assembled  for 
shipping,  the  filled  order  may  be  master/slave  for  localized 
control  and  ensurance  of  proper  delivery.  Similarly,  service 
requests  may  be  entered  as  delsync,  assignment  must  be 
synchronized  (transition  occurs  nightly  so  work  schedules  can 
be  prepared),  and  completed  requests  can  be  delsync  as  history 
records  (where  updates  are  unlikely,  if  even  permitted). 

Salesmen  entering  order  transactions  in  the  system  will 
first  request  items  from  their  local  region  inventory.  If  the 
items  are  not  available  locally,  adjacent  regions  (close 
physical  proximity)  will  be  checked.  The  warehouse  master 
inventory  files  will  update  local  regional  headquarters 
slaves,  but  may  only  transmit  change  indications  to  adjacent 
regions  until  off-peak  hours  when  update  transmission  would  be 
cheaper.  The  adjacent  region  salesman,  then,  would  have  to 
give  up  access  timeliness  and  wait  for  remote  information  if 
he  needed  the  most  recent  information  on  an  item  file  that  had 
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a  change  indicated  (DUI),  or  the  indication  could  include 
information  about  how  much  or  what  type  change  had  occurred. 
Delivery  scheduling  and  billing  would  be  periodic  batch  runs 
and  would  wait  to  be  sure  their  information  was  completely  up 
to  date.  Management  terminal  transactions  would  be  serviced 
quickly  from  local  copies  of  information  unless  the  user 
specifically  requested  most  recent  data. 

3.3.3  Control  Flow 

Management  by  delayed  synchronization  can  be  summarized 
as  in  Figure  3-1.  The  language  used  is  not  any  particular 
one,  it  is  just  a  combination  of  structured  primitives  and 
brief  explanations. 

Figures  3-2  through  3-5  represent  the  next  level  of 
refinement  of  the  overview  control  flow  and  they  begin  to 
incorporate  some  of  the  details  of  the  management  mechanisms 
for  delayed  synchronization. 

The  object  of  Figure  3-2  is  to  make  clear  that  the 
delsync  updates  are  handled  locally,  and  account  for  the 
divergence  of  the  file  copies  from  the  last  time  they  were 
synchronized  by  the  merging  process.  One  of  the  premises  of 
delsync  would  have  to  be  a  high  degree  of  locality  of 
reference;  that  is,  that  nearly  all  of  the  retrieval  requests 
at  a  particular  node  can  be  satisfied  by  data  at  that  node. 
The  few  requests  that  will  need  to  be  forwarded  to  other  nodes 


Figure  3-1.  Overview  of  Delsync  Management 
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DO  CASE 


ENDDO 


(transaction  type): 

CASE  (update) 

IF  (no  constraint  violated) 

THEN  delsync  update 
ELSE  make  transition 

synchronize  the  updating 

ENDIF 


ENDCASE 

CASE  (retrieval) 

IF  (BI  requested) 

THEN  get  data  from  all  copies 
select  best  result 
ELSE  get  data  (4  DQI )  from 
closest  copy 

ENDIF 

process  result  for  presentation  to  user 

ENDCASE 


are  handled  as  in  Figure  3-3. 

The  details  for  processing  BI  retrieval  requests  are 
shown  in  Figures  3-^  and  3-5.  The  broadcast  method  is 

probably  the  quickest  way  to  collect  data  from  all  the  other 
nodes,  but  the  use  of  timers  shows  that  there  may  be 
difficulty  if  any  nodes  are  so  busy  that  they  cannot  respond 
in  time  or  if  a  node  has  actually  crashed.  The  broadcaster 
cannot  tell  the  difference  in  general,  although  it  is  possible 
to  require  busy  nodes  to  send  some  appropriate  status 
indication  that  will  usually  be  guaranteed  to  arrive  within 


Del  sync  Update 


Fifijre  ^-2. 


copy  update  into  deferred  tierge  file  for  later  transmission 
"apply"  update  by  placing  it  into  differential  file 
associated  with  main  file  copy  (all  access 
to  file  is  directed  first  through  the 
differential  file) 


the  time  limit.  The  fact  some  node  did  not  contribute  to  the 
BI  collection  process  should  probably  be  communicated  to  the 
user  who  made  the  BI  request.  The  result  returned  to  him  will 
only  represent  the  best  information  currently  available  in  the 
system,  not  necessarily  the  actual  most  recent  data  value 
assigned . 


Figure  3-i-  Get  Data  From  Quickest  Copy 


DO  CASE  (data  location): 

CASE  (local) 

get  data 

ENDCASE 

CASE  ( not  local ) 

broadcast  request  for  data 
result  <--  first  return 


ENDDO 


ENDCASE 


Figure  3-^.  Get  Data  from  All  Copies 


set  timer 
broadcast  request 

DO  UNTIL  (timer  runout  or  all  responses  are  in) 
save  result  returned 
check  off  responding  site 

ENDDO 

IF  (timer  runout) 

THEN  alert  for  possible  crash 
note  non-responding  sites 
for  later  recovery 

ENDIF 


Figure  3-5.  Select  Result 


find  response  with  latest  timestamp  • 
result  <--  response  with  latest  timestamp 


consider  two  timestamps,  TS1  and  TS2: 

TS1  =  (  site-idi,  clock!  ) 

TS2  =  (  site-id2,  clock2  ) 

TS2  is  "later  than"  TS1: 

IF  clock2  >  clock! 

OR  IF  clock2  =  clock! 

AND  3ite-id2  >  site-id!. 
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3.3.4  Transition  Algorithm 

When  it  is  necessary  for  a  delsync  file  to  be  changed  to 
synch  mode,  an  algorithm  is  neeled  to  handle  the  transition. 
The  polling  solution  is  a  simple  master/slave  type  approach: 
the  node  discovering  that  a  transition  is  required  (call  it 
the  initiator)  takes  all  of  the  responsibility  for  that 
transition.  The  initiator  sends  a  transition  command  to  all 
copies  so  that  all  file  copies  can  be  locked  and  have  their 
modes  changed.  Upon  receipt  of  transition  acknowledgements 
(acks)  from  all  copies,  the  initiator  collects  from  each  copy 
all  of  the  updates  that  have  been  saved  since  the  last  merge. 
The  initiator  does  all  consistency  checking  and  conflict 
resolution,  assembles  a  total  update  package  with  the  updates 
in  correct  serial  order,  transmits  it  to  all  copies,  and  waits 
for  the  acknowledgements  of  receipt.  When  the  copies  receive 
the  update  package,  they  apply  the  updates  and  unlock  the 
file;  the  initiator  must  wait  until  all  copies  are  done.  It 
can  then  unlock  its  own  copy  and  proceed  with  the  update, 
which  will  now  be  processed  on  the  (newly  established)  synch 
file. 


This  solution  may  be  symbolized  with  an  evaluation  net 
diagram.  Figure  3-6  [FLLI??].  In  this  format,  circles 
represent  possible  states  or  "locations"  of  the  procedure 
being  symbolized,  rectangles  represent  incoming  "message 
locations"  or  queues,  and  horizontal  lines  stand  for 


Figure  3'6. 


Polling  Solution  for  Transition 


to  IDLE 


t  r\His  i  L  i  Otis  .  TrstisilLons  ,iro  "firod"  by  hov  tiift  tokens  on  nil 
input  loontlons  of  o  transition.  Tr  .in  s  i  t  i  on  s  may  involve  the 
sending  of  new  mess.ages,  represented  by  a  hollow  arrowhead  on 
a  transition  line.  Tiie  dots  following  the  arrowhead  Indioate 
whether  a  single  (one  dot)  or  multiple  (three  dots  imply  two 
or  more)  receivers  ;ire  intended.  Transition  actions  are 
explained  as  comments  alongside  the  lines. 

In  Figure  l-b,  then,  the  initial  state  consists  of  an 
internal  token  on  IhLK.  When  a  file  mode  change  from  delsync 
to  synch  is  required,  a  token  arrives  on  IN  IT  TRANS  and 
transition  T1  fires.  The  transition  command  is  sent  to  all 
other  nodes,  the  file  is  locked  and  has  its  mode  changed  from 
delsync  to  synch.  When  all  nodes  have  signified  ready  with 
their  acks  (.1  token  is  placed  on  TRANS  ACK  each  time  the 
commun  ic.it  ions  subsystem  receives  .in  acknowledgement  from 
anotiier-  node),  the  initi.itor  polls  the  sites  to  collect  all 
upd.ites  saved  since  the  last  merge  of  the  file  copies.  When 
the  update  p.ick.ige  h.is  been  received  by  .all  nodes,  the  file 
may  be  unlocked  and  any  pending  updates  may  proceed  on  the 
newly  synchronized  file. 

There  are  a  number  of  difficulties  with  the  polling 
solution  for  transition  .is  it  is  ilescribed.  A  big 
disadvantage  is  that  the  algorithm  takes  .in  unpredictable 
amount  of  time  while  w.iiting  lor  acks  ,ind  updates.  It  also 
cannot  distinguish  between  .1  node  that  is  slow  in  answering 


and  a  node  that  is  down.  Kven  if  the  "down”  node  has  truly 
oraslied  and  is  not  at  all  oper'ational  ,  it  still  probably  has 
unshared  updates  saved  since  the  last  merge.  If  cr'osslng  a 
thresliold  constraint  initiated  tiie  transition,  it  could  be 
dangerous  to  synchronize  and  proceed  without  the  crashed 
node's  outstanding  updates.  Ttu’eslxolds  would  have  to  be 
established  so  as  to  minimize  the  probability  of  an 
i  ncotis  i  steiU  result. 

Another  candidate  lor  a  transition  algor itlim,  based  on  a 
ring  solution,  is  shown  in  Figure  3-7.  In  this  case,  the 
transition  initiator  creates  a  transition  command  and  appends 
to  it  the  updates  (from  that  node)  which  have  been  saved  since 
the  last  merge.  The  combination  is  passed  to  the  next  node, 
wtiich  appends  the  updates  it  has  saved,  and  then  on  around  the 
logical  ring  one  node  at  a  time.  By  the  time  the  command 
returns  to  the  initiator,  all  the  updates  required  for  the 
merge  have  been  collected  and  the  entire  package  can  be 
circulated  for  application.  Notes  are  used  to  keep  track  of 
multiple  transitions  in  progress  simultaneously. 

A  big  drawback  to  ttie  circulation  scheme  is  that  there  is 
a  possibility  for  multiple  nodes  to  initiate  transitions  on 
the  same  file  at  the  same  time.  For  frequently  updated  files 
nearing  the  transition  threshold,  this  undesirable  property 
cannot  be  dismissed  as  Improbable. 


Figure  3~7-  Ring  Solution  for  Transition 
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A  good  way  to  eliminate  both  the  uncertainties  of  polling 
and  the  multiple  transitions  of  circulation  is  the  ring 
solution  of  Figure  3-8.  This  is  based  on  a  virtual  ring  with 
control  token  which  was  originally  developed  for  synchronized 
management  of  distributed  databases  [LELA78].  In  this 
solution,  all  of  the  nodes  are  logically  connected  in  a 
virtual  ring  by  assigning  a  permanent  control  number  to  each 
node  so  that  the  sequence  creates  a  single  logical  circuit 


touching 

every 

node 

in 

the  network.  The 

purpose  of  the 

control 

token 

which 

is 

circulated  around 

the  ring  is  to 

restrict 

certain 

activ ities 

of  the  nodes  to 

a  one-at-a- time 

sequential  flow  in  order  to  prevent  concurrent  access 
problems.  A  transition  flag  in  the  control  token  allows  only 
one  file  mode  transition  to  be  in  progress  at  a  time. 

In  this  solution,  a  transition  may  be  initiated  only  by  a 
node  possessing  the  control  token  and  when  no  other  transition 
is  in  progress.  The  initiator  sets  a  flag  in  the  token  to 
tell  which  file  should  have  its  mode  changed  to  synchronized, 
and  appends  the  updates  saved  since  the  last  merge  to  the 
control  token. 

As  each  node  after  the  initiator  receives  the  control 
token  (Figure  3-9),  it  appends  its  own  saved  updates,  changes 
the  mode  of  the  file  in  transition,  and  notes  any  down 
successors  in  the  ring  by  adding  their  names  to  the  list  in 


Figure  3'8. 


Virtual  Ring  with  Token  Solution  for  Transition 


pass  token 
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Figure  3-9.  Virtual  Ring: 


Non-initiating  Node  View 
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the  control  token.  The  control  token  and  updates  are  passed 
to  the  successor.  Subsequent  arrival  of  the  control  token  at 
a  non-initiating  node  implies  all  available  saved  updates  are 
present  in  the  package  and  can  be  applied  in  timestamp  order. 
The  file  is  then  unlocked  to  local  updates  and  the  control 
token  is  passed  to  the  ring  successor. 


The  virtual  ring  incorporates  timers  to  maintain 
circulation  of  the  control  token  itself.  Control  is 


relinquished  only  on 

positive 

acknowledgement 

of  receipt 

of 

the  control  token. 

If  a 

timeout  occurs 

before  the 

ack 

arrives,  failure  of 

the 

successor  is 

suspected . 

This 

mechanism  would  be  used  to  create  the  list  within  the  token  of 
-down  nodes  or  of  any  nodes  failing  to  participate  in  the 
transition  of  a  particular  file.  The  list  would  be  kept  by 
the  initiator,  along  with  the  total  update  collection,  for 
later  use  in  bringing  non-participating  nodes  up  to  date. 
Similar  timers  could  also  be  put  into  the  polling  solution. 

A  disadvantage  to  the  virtual  ring  solution  is  that  a 
node  must  wait  for  the  control  token  to  arrive  before  it  can 
attempt  to  initiate  a  transition.  Furthermore,  if  the  token 
has  a  transition  in  progress,  tne  attempt  will  have  to  wait  at 
least  one  more  control  token  circuit  of  the  ring.  In  the 
worst  case,  the  wait  could  involve  N-2  (N  nodes  in  the  ring) 
circuits  if  the  successor  initiated  the  current  transition  and 
each  other  node  also  had  a  transition  waiting. 
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We  will  choose  the  virtual  ring  solution  for  transition, 
despite  its  variability  of  execution  time,  because  of  its 
explicit  control  over  the  transition  process  and  the 
status-checking  mechanism  inherent  in  the  circulation  timers. 

One  particularly  interesting  problem  occurs  in  any  of  the 
transition  solutions  if  a  "down"  node  has  not  really  crashed 
but  is  just  disconnected  from  the  network.  That  node  could 
continue  (by  design  or  by  accident)  to  accept  updates  to  its 
local  copy  of  the  file,  which  is  still  in  delsync  mode.  This 
is  a  special  case  of  network  partitioning,  which  we  will  call 
single-node  disconnect.  If  it  is  deemed  serious  enough,  data 
quality  indicators  can  be  exchanged  among  copies  between 
merges  so  that  the  probability  of  error  due  to  exceeding 
thresholds  is  minimized.  For  example,  in  the  airline 
reservation  system,  each  indicator  could  contain  the  number  of 
seats  reserved  on  each  update.  The  copies  could  keep  a 
running  tally  of  the  number  of  seats  reserved.  Then  even  if  a 
node  is  down  at  transition  time,  the  number  of  seats  held  by 
updates  outstanding  before  crash  would  be  known. 

Single-node  disconnect  can  at  least  be  recognized  in  the 
virtual  ring  with  token  if  another  timer  is  reset  to  some 
maximum  allowable  delay  (e.g.,  the  time  required  for  a 
complete  circuit  of  the  virtual  ring)  on  each  token  visit.  A 
time-out  before  token  arrival  should  trigger  the  suspicion 
that  the  node  with  the  time-out  is  disconnected,  and  a 
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decision  could  be  made  (pre-programmed  or  interactively  with 
an  operator)  on  whether  to  discontinue  delsync  updates  as  well 
as  synchronized  and  master/slave.  Notice  that  this  timer  is 
in  addition  to  the  one  required  in  waiting  for  the 
acknowledgement  that  a  passed  control  token  has  been  received. 

3.3.5  Merging 

The  algorithm  for  merging  is  similar  to  that  for 
transition  except  that  the  file  mode  is  not  changed.  The 
initiator  sets  a  flag  in  the  token  to  tell  which  file  is  being 
merged,  and  appends  the  updates  saved  since  the  last  merge  to 
the  back  of  the  token.  Each  successive  node  around  the  ring 
appends  its  own  saved  updates  to  the  collection  circulating 
with  the  token.  The  collection  is  complete  when  the  token 
gets  back  to  the  initiator,  and  the  second  circulation  serves 
to  distribute  all  of  the  updates  to  all  of  the  nodes.  Upon 
returning  to  the  initiator  again,  the  merge  is  complete;  the 
collection  can  be  removed  and  the  token  flag  reset  for  some 
other  node  to  use.  If  some  node  is  down  when  the  merge 
occurs,  the  initiator  will  be  responsible  for  saving  the 
entire  update  collection  for  transmission  at  recovery  time. 

3.3.5  Update  Conflicts,  Error  Notification  and  Recovery 

Delsync  files  have  been  predicated  on  a  small  probability 
of  conflicting  updates.  Let  us  consider  what  to  do  in  case 


S'* 

conflicts  do  occur. 

The  first  type  of  conflict  is  multiple  updates  to  the 
same  item.  Differential  files  are  appropriate  for  handling 
this  aspect  of  delsync  files.  Local  updates  are  kept  in  the 
difference  file  between  merges.  If  at  merge  time  the  updates 
from  all  other  nodes  are  also  inserted  into  the  difference 
file,  conflicts  can  be  discovered  and  ordered  for  application 
according  to  their  origination  timestamps.  To  complete  the 
merge  operation,  the  entire  difference  file  is  applied  to  the 
main  archival  file  at  the  appropriate  time  (see  the  previous 
section  and  Figure  3-9). 

The  second  type  of  conflict  is  violation  of  delsync 
constraints;  it  will  have  to  be  checked  for  during  each  merge. 
This  is  the  case  where  none  of  the  individual  files  has 
triggered  the  transition  from  a  delsync-  to  a 

synchronized- type  file,  and  yet  in  aggregate  a  constraint  is 
violated  (see  the  airline  reservation  example  in  section 


t 

1 


[ 
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3. 3. 2. 2).  The  first  thing  the  DDBMS  must  do  after  discovering 
such  a  conflict  is  to  provide  an  error  notification.  The 
questions  are,  who  should  be  told,  how,  and  what  does  he  need 
to  know.  The  answers  depend  on  what  recovery  procedures  are 
appropriate  to  the  applications  using  the  file.  If  the  error 
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action  would  then  be  to  mark  the  file  in  violation  and  allow 
thereafter  only  those  updates  which  would  counteract  the 
violation  (e.g.,  cancelled  reservations)  until  the  error  was 
cancelled  out. 

A  real  recovery  might  be  possible  if  the  constraint 
violation  were  detected  during  collection  of  the  updates  for 
merge  (i.e.,  before  application  of  the  difference  file).  The 
updates  could  be  listed  in  reverse  order  by  timestamp  and  the 
last  few  examined  to  see  if  rejecting  them  (albeit,  belatedly) 
would  prevent  the  conflict.  Again,  the  application  would 
determine  the  appropriateness  of  what  is,  in  effect,  backing 
out  the  last  transactions.  It  would  certainly  not  please  most 
clients  to  have  a  transaction  accepted  one  day  and  rejected 
the  next.  The  cost  of  such  a  possibility  would  have  to  be 
considered  very  carefully  when  first  choosing  a  delsync 
management  approach  and  then  setting  the  delsync  constraints 
and  transition  triggers. 
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Chapter  4 

EVALUATION  OF  THE  PROPOSED  APPROACH 

4.1  INTRODUCTION 

It  is  common  to  evaluate  new  management  schemes  by  doing 
some  experiments  to  make  comparisons  with  schemes  which  are 
well  understood  and  documented.  In  the  case  of  distributed 
database  management,  that  is  an  impossible  task  because  there 
are  no  such  things.  The  implementation  of  general-purpose 
distributed  database  management  systems  is  just  beginning  with 
systems  such  as  INGRES  and  SDD-1  (for  details  see  section  A. 7 
of  the  appendices) .  They  are  not  fully  implemented  and  have 
not  been  designed  with  test  bed  functions  in  mind  so  that  new 
algorithms  or  schemes  can  be  plugged  into  the  system  in  place 
of  the  original  ones.  It  is  certainly  outside  the  scope  of 
the  research  for  this  dissertation  to  build  an  appropriate 
test  bed,  or  even  to  try  to  coordinate  with  the  developing 
implementations  in  order  to  provide  the  basis  for  direct 
experimentation. 

Fortunately,  we  do  have  an  alternative  approach  that  will 
be  useful.  Rather  than  experimenting  and  measuring  things 
that  do  not  exist,  we  can  construct  models  of  the  competing 
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schemes  and  predict  what  the  cost/performance  comparisons  will 
be.  Then,  as  implementations  become  available  for 
measurement,  we  can  begin  to  validate  the  modeling  approach 
and  test  the  predictions.  At  this  early  stage  of  distributed 
database  experience,  such  modeling  and  the  analysis  which 
leads  to  prediction  seems  to  be  quite  an  appropriate 
alternative. 


4.2  A  MODELING  APPROACH 


Response  time  is  defined  as  the  interval  between  a  user's 
request  for  a  data  tr ansae tion ,( retr iev al  or  update)  and  his 
notification  of  its  completion.  In  a  distributed  system, 
"completion"  may  mean  a  variety  of  things.  For  example, 
completion  of  an  update  in  a  master/slave  system  could  mean 
the  update  has  been  applied  at  the  master  copy  and  saved  for 
later  transmission  to  slave  copies.  In  contrast,  a 
synchronized  system  might  consider  an  update  complete  only 
when  all  copies  have  acknowledged  application.  Interactive 
retrieval  completion  is  simpler,  because  the  user  receives  his 
result  at  his  terminal.  To  encompass  some  of  this  variety,  we 
will  consider  a  transaction  to  be  complete  when  the  user 
receives  notice  that  he  can  proceed  with  his  next  transaction. 


We  separate  contributions  to  response  time  (RT)  into  the 
following  stages  (explained  below): 

•  process  request  (PI) 


X- 
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•  optimization  strategy  (OS) 

•  transmission  to  remote  nodes  (GDI) 

•  process  data  (PD) 

•  transmission  from  remote  nodes  (CD2) 

•  process  result  (P2). 

For  an  individual  data  transaction,  the  response  time  will  be 
RT  =  PI  +  OS  +  f(  GDI,  PD,  GD2  )  P2, 

where  the  function  f  depends  on  the  network  topology  and  on 
the  actual  management  scheme  used. 


The  process  request  stage  interprets  the  transaction  and 
maps  it  from  the  user's  view  of  the  limited  database  portions 
he  can  access  to  a  global  (logical)  database  view.  If  the 
request  is  complicated,  it  is  usually  decomposed  into  a 
hierarchy  of  simple  steps.  INGRES,  for  example  (see  section 
A. 7.1  of  the  appendices),  decomposes  multi-variable  queries  to 
allow  multiple  one-variable  processing.  In  a  centralized 
system,  the  one-variable  steps  would  be  done  sequentially;  in 
a  distributed  system  they  could  be  done  in  parallel  at 
multiple  nodes  if  the  data  were  distributed. 

In  the  optimization  strategy  stage,  the  request  is 
distributed  to  the  nodes  holding  appropriate  pieces  of  the 
data  or  any  special  programs  required.  OS  will  represent 
strictly  the  processing  cost  at  the  originating  node.  This 
can  be  very  complicated  if  selection  of  some  combination  of 
data  movements  and  local  processing  is  desired  (as  in  SDD-1, 
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The  process  r esul I  stage  collects  and  coordinates  any 
remote  results  being  returned,  transforms  from  the  global  data 
view  back  to  the  local  user’s  view,  and  does  any  processing 
required  to  present  results  to  the  user. 

The  equation  for  response  time  is  designed  to  empliasize 
the  effects  of  distribution  on  response  time.  It  will  be  used 
to  evaluate  the  proposals  of  this  dissertation  to  trade  off 
timeliness  constraints  and  to  delay  synchronization,  since 
either  of  these  will  contribute  to  the  particular  nature  of 
the  function  f. 

In  order  to  concentrate  on  comparing  the  management 
schemes,  we  will  take  a  very  narrow  viewpoint  and  reduce  our 
consideration  to  .just  the  major  contributions  of  processing 
data  (Pl'l  and  transmission  to  and  from  remote  nodes  (CPI, 

cn^’): 

RT  =  f  (  CPI,  PP,  CDd  )  . 

We  consider  this  to  be  the  first-order  statistics  of  the 
problem,  and  consequently  look  to  first-order  queuing  analysis 
for  the  solutions.  That  is,  the  average  contribution  from 
(  PI  OS  P2  )  will  be  so  close  to  constant  for  the  various 
management  schemes,  that  it  will  contribute  only  a  constant 
offset  in  tlie  average  RT  value.  Second-order  statistics,  such 
as  the  variance,  are  probably  not  so  simply  partitioned  and 
will  not  be  dealt  with  here.  In  this  chapter,  we  consider, 
then,  Poisson  arrivals  and  exponentially  distributed  service 


times  with  a  sin^iile  server  at  each  node  (central  or 
distributed),  the  basic  M/M/1  queuing  situation  [KLK175].  A 
summary  of  the  notation  used  is  provided  as  Table  4-1. 

In  order  to  focus  on  the  distribution  issues  of  the 
various  management  schemes,  we  will  use  a  further 
simplification  of  constant,  fixed  communication  delays  between 
terminals  and  a  central  site  or  between  any  pair  of  nodes  in  a 
network  supporting  distributed  management.  Thus,  CD1  =  CD<2  = 
CD  ,  a  constant.  Since  this  generally  represents  an 
idealized,  best-case  communications  delay,  we  will  actually  be 
computing  lower  bounds  for  the  true  values  of  response  time. 
This  is  quite  appropriate,  since  our  primary  interest  is  in 
comparing  the  management  schemes,  not  in  predicting  actual 
performance  values  for  implemented  systems.  Choice  of  some 
maximum  allowable  communications  delay  that  included  a 
tolerance  for  resource  contention  and  addition  of  some  maximum 
offset  value  for  (  PI  OS  +  )  would  be  needed  in  order  to 

calculate  approximate  upper  bounds  instead. 

Update  and  retrieval  transactions  will  be  assumed  to 
arrive  uniformly  distributed  over  all  the  terminals  in  a 
system,  and  arrival  rates  will  be  expressed  per  terminal. 
Complete  duplicate  copies  of  the  database  will  be  assumed  at 
each  node  except  under  delsync  management,  where  the  copies 
naturally  diverge  between  merges. 


Table  4-1 . 


Summary  of  Notation,  Basic 


CD 

L 

LR 


LU 


XBAR 

XR 

XU 

n 

N 

rho 

T 

RT 

ArLU/LR 


communications  delay 

average  system  arrival  rate 

average  arrival  rate  per  terminal 

of  retrieval  transactions 

average  arrival  rate  per  terminal 

of  update  transactions 

average  system  service  time 

average  service  time  for  retrievals 

average  service  time  for  updates 

number  of  terminals  per  node 

number  of  nodes 

utilization 

average  total  time  spent  in  the 

system 

response  time 

update/retrieval  ratio  per  terminal 


To  evaluate  the  QR-BI  and  delayed  synchronization 
proposals  of  this  thesis,  we  will  wish  to  extend  the  basic 
models  to  encompass  explicitly  the  data  vs.  access  timeliness 
tradeoff  and  add  a  model  for  the  delayed  synchronization 
management  scheme.  The  approach  is  the  same  as  in  the  basic 
analysis : 

•  describe  the  system  flow  of  transactions  according  to 
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each  management  philosophy, 

*  use  queuing  theory  (or  the  operational  method)  to  compute 
the  average  response  time  as  a  function  of  the  system 
parameters,  and 

•  pick  a  scenario  or  standard  set  of  parameters  to  use  as  a 
basis  for  comparison. 

The  queuing  model  that  is  appropriate  for  the  QR-BI 
environment  is  the  head-of-the-line ,  non-preemptive ,  priority 
queue  [KLEI76].  We  will  use  two  priority  classes:  high  for 

QR  retrievals  and  low  for  BI  retrievals  and  for  updates. 
Within  a  priority  class,  order  is  f irst-come-first-served  on 
an  arrival  basis.  This  actually  introduces  a  new,  extra 
dimension  of  "quickness"  to  QR ,  since  up  to  this  point  we  have 
been  treating  it  strictly  as  a  function  of  locally  or  remotely 
stored  data. 

To  compute  average  response  time,  average  total  times  for 
the  different  transaction  types  in  each  node  are  combined  with 
appropriate  communications  delays  according  to  the  particular 
management  schemes.  The  basic  assumptions  of  constant,  unit 
communications  delay,  constant  system  load  of  100  terminals  in 
the  system,  and  uniform  distribution  of  transaction  arrivals 
are  carried  over  into  the  QR-BI  scenario  from  the  basic  case. 

From  Kleinrock's  [KLEI76]  derivations  for 
head-of-the-line  priority  queuing,  we  specialize  to  two 
priority  classes  and  exponential  service  distributions  to  get 
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average  queuing  wait  times  (see  Appendix  B) .  New  notation  is 

defined  in  Table  4-2;  for  previously  used  symbols,  refer  back 
to  Table  4-1 . 
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Table  4-2.  Summary  of  Notation,  Extended 

WO  average  wait  due  to  transaction  in  service 

LQR  arrival  rate  per  terminal  of  quick  response  retrieval 

requests 

LBI  arrival  rate  per  terminal  of  best  information 

retrieval  requests 

B  ratio  of  QR/BI  arrivals 

L1  arrival  rate  of  low  priority  transactions  to  the 

priority  queue 

E2  arrival  rate  of  high  priority  transactions  to  the 

priority  queue 

X1BAR  average  service  time  for  low  priority  transactions 

X2BAR  average  service  time  for  high  priority  transactions 

W1  average  wait  time  (in  queue)  of  low  priority 

transactions 

W2  average  wait  time  (in  queue)  of  high  priority 

transactions 

sigmal  =  LI  •  X1BAR  ♦  L2  *  X2BAR 

sigma2  =  L2  •  X2BAR 

TQR  average  total  time  (in  queue  and  in  service)  for  QR 

retrieval 

TBI  average  total  time  (in  queue  and  in  service)  for  BI 

retrieval 

TU  average  total  time  (in  queue  and  in  service)  for 

update 

pilocal)  probability  that  a  retrieval  can  be  satisfied  from 
local  data 
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^.3  the  models  and  the  analysis 

4.3.1  Centralized  Management 

A  centralized  database  system  consists  of  a  central  site 
receiving  update  and  retrieval  transactions  from  all  the 
terminals  in  the  system.  We  assume  that  the  terminals  are 
located  a  fixed  communication  delay,  CD,  from  the  central 
site.  The  flow  diagram  for  the  queuing  model  of  a  centralized 
system  is  shown  in  Figure  4-1.  Its  purpose  is  to  explicitly 
show  the  arrival  rates,  the  communication  paths  and  delays, 
and  the  logical  flow  of  the  transactions  through  the  system 
(e.g.,  transactions  arrive  from  and  are  returned  to  the 
cluster  of  terminals,  which  is  represented  by  the  double 
rectangle  in  the  diagram). 

The  mathematical  details  of  the  queuing  analysis  for  the 
model  are  in  section  B.1.1  of  the  appendices.  The  average 
response  time  for  the  system  can  be  expressed  by 
RT  =  CD1  +  T  t-  CD2, 

where  CD1  is  the  communications  delay  between  the  terminals 
and  the  central  node,  T  is  the  total  time  spent  by  a 
transaction  in  the  queuing  system,  and  CD2  is  the 
communications  delay  from  the  central  node  back  to  the 
terminals.  The  total  time  in  the  queuing  system  is  a  function 
of  the  service  times  for  updates  (XU)  and  retrievals  (XR),  the 
ratio  of  update  arrivals  to  retrieval  arrivals  (A),  and  the 


utilization  of  the  server  ( rho ,  which  can  also  be  thought  of 
as  what  fraction  of  time  the  central  node  is  busy  processing 
the  transactions): 


A  •  XU  -►  XR 

T  - - 

(A  +  1  )*( 1  -  rho) 

We  are,  on  the  average,  restricted  to  rho  <  1  in  order  that 
the  system  be  stable  and  not  bog  down.  Our  performance 
evaluator  of  centralized  management  is  thus 

A  •  XU  +  XR 

RT  =  2»CD  +  - 

(A  +  1  )•( 1  -  rho)  , 

by  substituting  into  the  equation  for  RT  what  we  know  about 
GDI,  CD2,  and  T.  Figure  4-2  shows  how  this  response  time 
varies  with  the  update/retrieval  ratio.  The  numerical  values 
have  been  chosen  on  the  basis  of  a  unit  communications  delay, 
so  that  if  we  consider  a  communications  delay  of  1  second 
(reasonable  from  section  3.1.3),  we  are  talking  about  update 

service  time  of  200  ras,  and  retrieval  service  time  of  100  ms 

« 

(reasonable  for  large  database  disks  commonly  used  today). 
Retrieval  arrival  rate  is  chosen  as  the  independent  variable 
for  computational  convenience  and  because  the  system 
performance  does  depend  on  the  arrivals  very  basically 
(without  transaction  arrivals  the  system  has  no  work  to  do  and 
performance  is  meaningless).  On  this  basis,  response  time 
comes  out  in  seconds. 


NORMALIZED  RESPONSE  TIME 

2..y0  2.80  3,20  3,60  4,00 


Figure  4-2.  Dependence  of  Response  Time  on  the  Update/Retrieval  Rat 


CENTRALIZED 

BASIC  CASE 

NUMBER  TERMINRLS/NODE=  100 
COMMUNICATIONS  DELAT=  1.0 
UPDATE  SERVICE  TIME=  0.2 
RETRIEVAL  SERVICE  TIME=  0.1 


UPDATE/RETRIEVAL  RATI0=  A 
NUMBER  OF  N0DES=  1 
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Each  transaction  in  the  centralized  database  system 
requires  communication  only  between  the  originating  terminal 
and  the  central  node,  so  that  network  messages,  as  between 
copies  of  distributed  database  managers,  are  not  necessary. 
This  would  give  us  a  cost  prediction,  NM  for  the  number  of 
messages,  for  a  centralized  system  of 
NM  =  0  . 

In  order  to  make  any  comparison  with  the  distributed  systems 
still  to  be  analyzed,  let  us  consider  instead  a  centralized 
system  as  implemented  on  top  of  a  communications  subnet  in  a 
manner  directly  analogous  to  the  subnet  which  supports 
communication  among  the  nodes  of  a  distributed  system.  In 
this  way  we  will  be  able  to  compare  the  cost  of  the  systems  by 
counting  network  messages.  Otherwise,  we  would  have  to 
differentiate  among  the  costs  of  long-distance  lines  from 
terminals  to  a  central  node,  the  costs  of  local  lines  between 
terminals  attached  directly  to  individual  nodes  of  a 
distributed  system,  and  the  communications  subnet  required 
among  the  dispersed  nodes  of  the  distributed  system.  So  we 
will  consider  the  long-distance  communications  from  terminals 
to  a  central  node  to  be  of  the  same  cost  as  the  inter-node 
messages  in  a  distributed  system.  The  local  communications 
between  the  terminals  attached  locally  to  a  node  of  a 
distributed  system  would  not  be  counted  similarly,  since  their 
contribution  to  the  operational  cost  of  the  distributed  system 
is  not  comparable.  For  this  analysis,  then,  the  cost 
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predictor  of  a  centralized  system  is 
NM  =  2  . 

System  flow  for  centralized  management  of  a  QR-BI 
database  system  is  summarized  in  Figure  4-3.  The  two  types  of 
retrievals  and  priority  queuing  are  shown  explicitly.  The 
total  processing  load  is  the  same  as  in  the  basic  case,  and  it 
seems  likely  that  reordering  the  queue  will  have  little  effect 
on  the  average  system  response  time  (see  the  discussion  of 
conservation  laws  in  [KLEI76]  for  more  detail),  although  it 
may  greatly  influence  the  variance,  a  second  order  statistic 
we  have  not  been  considering.  In  Figure  4-4,  the  B=0  case 
where  there  are  no  QR  requests  at  all  is  exactly  the  basic 
case.  Indeed,  the  introduction  of  QR  and  BI  retrievals  does 
not  change  the  response  time.  Figures  4-5  and  4-6  show  how 
the  contributions  to  the  average  shift  as  the  ratio  of  QR/BI 
changes.  The  dependence  of  the  response  time  on  the 
updat^retrieval  ratio  is  shown  in  Figure  4-7. 
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Figure  '*-5.  Average  Response  Time  from  Contr Ibut ions  for  QR/BI-1 
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Figure  4-6.  Average  Response  Time  from  Cont r i hut  Ions  for  QR/BI-4 
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Figure  k-7.  Response  Time  Dependt'tice  on  the  Upda  t  e/Re  t  r  i  eva  I  Ratio 


4,3.2  Master/Slave  Management 


We  will  consider  two-host  resiliency  (see  section 
A, 2. 2,2. 3  of  appendices)  for  the  master/slave  management  of  a 
distributed  database  system:  all  updates  are  forwarded  to  the 
master  for  application,  then  to  the  back-up,  and  finally  out 
to  all  the  slaves.  The  acknowledgement  to  the  user  of  the 
system's  acceptance  of  the  update  transaction  is  sent  only 
after  the  back-up  has  successfully  applied  the  update  to  its 
(complete  duplicate)  copy  of  the  database.  Each  node  will 
process  all  retrieval  requests  from  its  own  local  terminals 
against  its  own  (complete  duplicate)  copy  of  the  database. 
The  network  flow  diagram  for  the  queuing  model  of  the  system 
is  shown  in  Figure  4-8,  and  the  flow  internal  to  the  nodes  is 
in  Figure  4-9.  By  adding  up  flow  in  and  out  of  each  node,  we 
see  that  the  processing  traffic  required  is  the  same  for  each 
node  in  the  network  (see  section  B.1.2  of  the  appendices  for 
the  details).  This  gives  a  total  time  in  the  queuing  system 
of 

N  •  A  »  XU  +  XR 

r  - - 

(N  •  A  >  1  )  •  (  1  -  riio) 

The  average  response  time  throughout  the  network  must 
take  into  account  both  the  different  types  of  nodes  and  the 
different  transactions.  For  the  one  master,  one  back-up,  N-2 
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slaves,  and  an  update/retrieval  ratio  of  A, 

RT  =  (1/N)  *  [A/(A+1  )]  »  2  »  CD 

+  (1/N)  »  [A/(A+1  )]  »  2  »  CD 
+  [(N-2)/N]  »  [A/(A+1)]  »  3  *  CD 
+  [2»A/(A+1 )  +  1/(A+1)]  »  T  , 
which  simplifies  to 

A  3N-2  2A+1  N»A»XU  +  XR 

RT  = - » - »  CD  + - • - 

A  +  1  N  A  +  1  (N»A-t-1)»(1-rho)  . 

Figure  4-10  shows  how  the  response  times  for  updates  and 

retrievals  combine  to  form  the  average.  Figure  4-11  shows  how 

the  response  time  depends  on  the  number  of  nodes  for  a  given 

update/retrieval  ratio,  and  Figure  4-12  how  it  depends  on  the 

update/retr ieval  ratio  for  a  given  number  of  nodes. 

In  the  master/slave  system,  since  retrievals  are  all 
handled  locally,  they  contribute  no  network  messages  to  the 
cost  predictor.  The  contribution  of  the  updates  is  weighted 
by  their  probability  of  occurrence,  so  that  the  system  cost  in 
terms  of  the  average  number  of  messages  required  to  process  a 
transaction  is 

A  3»N-2 

NM  - - • - 

A-t-1  N 


Figure  *♦-10.  Average  Response  Time  from  Contributions 
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gure  4-12.  Response  Time  Dependence  on  the  Update/Retrieval  Ratio 
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The  system  flow  in  Figure  ^-13  for  master/slave 


management  of  a  QR-BI  retrieval  system  is  considerably  more 
complicated  than  the  basic  case  in  Figure  U-8.  Addition  of 
the  BI  retrievals  requires  extra  flow  paths.  We  will  choose 
to  send  all  of  the  BI  requests  to  the  master  for  processing  so 
that  we  can  ensure  that  "best"  is  with  reference  to  all  the 
activity  in  the  network.  No  BI  processing  will  be  required  in 
the  back-up  or  slave  nodes,  as  is  clear  from  the  internal  flow 
diagrams,  Figure  4-1^.  The  Bis  will  increase  the  processing 
load  on  the  master,  and  as  shown  in  the  flow  diagram,  result 
in  total  arrival  rates  different  for  the  master  from  the 
back-up  and  slaves  (which  will  still  have  equa^  loads). 


The  expression  for  the  response  time  (developed  in 


section  B.2.2  of  the  appendix)  is  rather  cumbersome: 
A  3N-2 

RT  = - »  - *00  +  TUm  +  TUb 

A  +  1  N 

1  N-t 

+ - »  - •  2  »  CD  +  TBIm 

(  A-t-1  )»(B+1  )  N 


B  TQRra  TQRb  N-2 

+ -  »  - + - +  TQRs«-._ 

(A+1)»(B-^1)  N  N  N 


where  TUm  and  TUb  are  the  total  time  in  the  master  and  back-up 
nodes  respectively  for  updates,  where  TBIm  is  the  BI  time  in 
the  master,  and  where  TQRm ,  TQRb,  and  TQRs  are  the  QR  times  in 
the  master,  back-up  and  slaves,  respectively.  This  additional 
complexity  over  the  basic  case  (without  QR-BI)  will  prompt  us 


to  look  for  significant  performance  or  cost  benefits  in  order 
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to  justify  the  QR-BI  approach  for  master/slave  management. 
Figure  4-15  shows  the  response  time  dependence  on  the  number 
of  nodes  in  the  system.  It  is  similar  to  Figure  4-11  for  the 
basic  model,  but  we  already  notice  trouble:  the  master 
saturates  and  drives  response  time  up  at  lower  arrival  rates 
for  QR-BI  than  for  basic. 

In  Figure  4-16  we  see  how  response  time  depends  on  the 
ratio  of  QR  and  BI  retrievals.  Notice  here  that  a  QR/BI  of 
zero  means  all  update  and  (BI)  retrieval  requests  must  be  sent 
to  the  master,  much  as  in  a  centralized  system.  Increasing 
values  of  QR/BI  represent  increasing  amounts  of  activity  that 
can  be  handled  locally  by  the  slaves,  reducing  resource 
contention  at  the  master  and  allowing  higher  arrival  rates  to 
be  handled.  Figure  4-17  shows  the  response  time  dependence  on 
the  update/retrieval  ratio. 

The  cost  prediction  for  the  QR-BI  model  of  master/slave 
management  simply  adds  the  contribution  for  the  Bis  to  the 
update  contribution  of  the  basic  case: 

A  3N-2  1  N-1 

NM  - - » -  +  -  » - »  2 

A+1  N  (A+1)»(B+1)  N 
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Figure  4-l6.  Dependence  of  Response  Time  on  the  QR/BI  Ratio 
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Figure  4-17.  Response  Time  Dependence  on  the  Update/Retrieval  Ratio 


MflSTER/SLflVE 


QR-BI  CASE 


NUMBER  TERM1NRLS/N0DE=  10 
COMMUNICATIONS  0ELRr=  1.0 
UPDATE  SERVICE  TIME=  0.2 
RETRIEVAL  SERVICE  T1ME=  0.1 
RETRIEVALS:  QR/BI=  1.00 

UPDATE/RETRIEVAL  RATI0=  A 


RETRIEVAL  ARRIVAL  RATE  PER 


0.16  0.20 
TERMINAL 


91 


4.3.3  Synchronized  Management 

LeLann  has  suggested  [LELA78]  that  synchronized 
management  can  be  handled  by  explicitly  sequencing  update 
requests  using  tickets  similar  to  ones  found  in  a  bakery  to 
assign  service  order  to  the  customers.  The  ticket  order 
ensures  that  updates  are  processed  in  exactly  the  same  order 
throughout  the  distributed  system.  To  eliminate  concurrency 
in  accessing  (as  well  as  to  avoid  deadlock  over  data 
resources)  and  still  have  distributed  control  over  the 
distributed  database,  the  database  hosts  of  the  system  are 
arranged  in  a  virtual  ring  by  assigning  permanent 
identification  numbers  in  a  sequential  increasing  fashion 
around  the  ring  so  that  a  predecessor  and  successor  are 
defined  for  each  node.  A  special  message  called  the  control 
token  is  circulated  around  the  virtual  ring  to  implement  the 
ticketing  facility.  Tickets  can  be  taken  from  the  "dispenser” 
(the  control  token)  only  by  the  node  owning  the  token  and  are 
assigned  to  pending  update  requests  only  after  the  control 
token  has  been  accepted  by  that  node's  ring  successor.  Thus 
when  an  update  request  arrives  at  a  particular  node,  it  is 
assigned  the  next  ticket  that  the  node  has  available.  If 
there  are  no  tickets  left,  the  request  must  wait  in  a  pending 
queue  until  the  control  token  arrives  (back  at  that  node)  and 
the  node  can  get  more  tickets. 


rvcWeteil  upJ.it  e  requ»*3ts  .ire  sent  throughout  the  system. 
All  rioJes  are  restrioteJ  to  .applying  updates  In  ticket  order, 
so  it'  preceding  ticket  numbers  are  missing,  a  newly  arriving 
ticketed  update  must  be  queued  until  .ill  lower  ticket  numbers 
hive  been  prooesse»l.  The  token  et'tectively  acts  in  place  of  .i 
centralized,  master  clock  to  provide  global  event  sequencing. 

Synchronized  management  schemes  have  not  been  simple  to 
model  IdAKCYSJ.  Even  the  logically  simple  virtual  ring  does 
not  model  directly  into  a  simple  queuing  system.  An  arriving 
update  transaction  not  only  waits  for  prior  transactions 
queued  for  service,  it  may  also  wait  to  get  a  ticket  assigned, 
it  must  wait  for  prior  ticketed  updates  to  be  applied,  and  it 
may  have  to  wait  to  see  if  prior  ticket  numbers  are  used  or 
have  expired.  We  will  .ivoid  some  of  this  complexity  by 
clioosing  to  work  with  .i  very  simple  logical  representation  of 
the  virtual  ring,  a  sequential  ring  with  the  successors 
located  in  simple  ring  order.  Even  in  fully  interconnected 
networks  (where  each  node  lias  a  direct  commun  ic  .it  ion  link  to 
each  other),  we  might  want  to  assign  successors  according  to 
physical  proximity.  The  sequential  ring  allows  us  to  continue 
using  the  assumption  of  constant,  fixed  communications  delays 
without  consideration  of  routing  .ind  network  interconnection 
patterns.  We  will  see  that  the  solution  is  efficient  enough 
to  compete  with  the  other  management  schemes  only  under 
certain  conditions.  For  now  our  concern  is  that  it  is  simple 
enough  to  be  modeled. 
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Update  and  retrieval  transactions  will  be  considered  as 
Poisson  arrivals  uniformly  distributed  over  all  network 
terminals,  as  before.  If  we  do  not  allow  tickets  to  be 
requested  ahead  in  anticipation  of  future  arrivals,  the  update 
requests  will  wait  for  the  token  to  arrive  in  order  for 
tickets  to  be  requested  and  assigned.  The  simplest  scheme  to 
ensure  update  application  in  the  ticket  order  is  to  circulate 
the  updates  by  appending  them  to  the  token  and  removing  tliem 
when  they  get  back  to  their  originating  node.  This  means  that 
updates  arrive  at  each  node  all  at  once  to  be  processed  in  a 
batch . 

Retrieval  requests  will  still  be  handled  locally  at  each 
node  (the  complete  duplicate  copies  assumption).  In  order  to 
be  sure  the  retrievals  get  up-to-date  information  they  will 
have  to  wait  for  the  current  circulating  set  of  updates  to  be 
applied.  This  means  retrieval  requests  will  also  be  saved 
until  the  token  arrives  and  then  be  added  to  the  batch  behind 
the  current  set  of  updates.  Figure  M-18.  The  flow  diagram  of 
Figure  4-19  shows  how  all  the  nodes  are  connected,  while  the 
internal  details  for  a  single  node  are  in  Figure  4-20.  Notice 
that  even  this  simplified  node  model  is  not  a  standard  queuing 
system,  since  service  may  not  begin  until  the  token  arrives 
and  transactions  arriving  after  batch  processing  begins  may 
not  Join  the  queue  being  processed.  This  aspect  is  emphasized 
in  Figure  4-20  by  the  barrier  which  is  triggered  for 
transmission  only  by  the  arrival  of  the  control  token  (think 


The  average  response  time  for  a  transaction  will  be  made 
up  of  three  parts; 

*  waiting  for  the  token, 

*  waiting  for  the  predecessors  in  the  queue  to  be 
served,  and 

*  actual  service  time. 

Using  our  assumption  of  activity  being  uniformly  distributed 
across  ail  terminals  and  nodes  of  the  network,  we  will  say 
that  the  average  wait  for  the  token  is  half  of  the  time  for  a 
virtual  circuit.  The  order  in  the  queue  is  updates  from  all 
other  nodes,  local  updates  and  then  local  retrievals  (Figure 
4-18),  so  all  local  transactions  wait  for  non-local  updates 
(from  the  other  N-1  nodes)  to  be  processed.  Since  average 

■  ’  • 


service  times  jre  constant,  the  uniform  distribution  says  the 
average  wait  for  local  updates  will  be  for  half  the  total 
number  and  the  average  wait  for  local  retrievals  will  be  for 
all  updates  plus  half  the  number  of  retrievals.  Thus  we  end 
up  with  (for  details,  see  section  B.1.3  of  appendices) 

RT  =  .5»N*CD  >  n*(N-1 )»N*CD«A»LR«XU 

♦  [A/(A>1)]  •  (  .5*n«N»CD*A«LR*XLI ) 

♦  tl/(A+l)]  •  [n»N*CP»LR»(A»XU^.5«XR)]  . 

Figure  shows  how  this  response  time  varies  with  the 
number  of  nodes  and  Figure  ^-22  shows  the  dependence  on  the 
upd at e/ ret r ie V  a  1  ratio. 

If  we  treat  the  control  token  as  having  the  updates  to  be 
circulated  appended  to  it,  the  combined  token/update  package 
can  be  considered  as  a  message.  For  each  update,  then,  N 
messages  are  re^^uired  to  complete  the  circulation,  and  the 
number  of  messages  averaged  over  both  updates  and  retrievals 
gives  us 


NM  =  [A/( A>1  )  ]  •  N  . 
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Figure  ^-21.  Dependence  of  Responie  Time  on  the  Number  of 
Nodes  in  the  S/stem 
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gure  4-22.  Dependence  of  Response  Time  on  the  Update/Retrieval  Ratio 
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I'ht*  network  flow  for  synohron  I  ^ei)  monngement  In  h  QR-BI 
environment  is  just  the  same  as  for  the  basic  model,  Figure 
'*-19.  What  is  different  is  the  queuing  discipline  within  each 
node,  Kigui'e  U-2j.  QH  retrievals  will  get  head-of- the- 1  i  ne 
priority  treatment  if  they  arrive  while  a  batch  is  in 
progress.  Ottierwise  they  will  simply  queue  by  themselves  for 
immediate  processing  service.  HI  retrieval  requests  will  be 
saved  up  until  ttie  token  arrives  and  then  added  to  the  end  of 
ttie  b.itoh  behind  the  updates,  so  that  their  results  will 
provide  information  that  is  "best"  within  the  context  of  the 
currently  circulating  update  batch.  Any  BI  I'equests  arriving 
after  the  token  (even  if  it's  during  the  batch  processing) 
must  be  saved  up  until  the  next  token  arrival  to  ensure  "best" 
with  respect  to  the  concurrent  updating  activity  in  the 
system.  As  in  the  b.isic  moilel  ,  we  .assume  arrivals  are 
uniformly  distributed  over  the  system's  terminals  and  that 
each  node  m.iintalns  a  complete  duplicate  copy  of  the  database. 

Figure  U-24  shows  the  response  time  dependence  on  the 
number  of  nodes  in  the  virtual  ring.  Figure  4-2b  the 
dependence  on  the  update/retrieval  ratio,  and  Figure  **-2b  the 
dependence  on  the  ratio  of  QR/Bl. 

As  in  tlie  basic  case  for  synchronized  management.  It  Is 
only  the  updates  which  contribute  network  message  traffic  to 
the  cost  prediction, 

NM  =  I  A/(A+1  )  ]  •  N  . 


Figure  k-23-  Internal  Flow  Diagram  for  Synchronized  QR-BI  Node 
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Figure  4-z6.  Response  Time  Dependence  on  the  QR-Bl  Ratio 
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4.j.U  Delayed  Synchroni zat ion  Management 

There  are  really  three  phases  for  us  to  be  concerned  with 
in  delsync  management: 

•  ordinary  daily  operation:  where  updates  are  all 

processed  locally,  most  QRs  are  processed  locally 

(because  the  copies  diverge  doing  local  updates  some  QRs 
may  need  data  not  stored  in  the  local  copy) ,  and  BI 
results  must  be  collected  from  all  other  nodes  so  that 
the  most  recent  result  can  be  selected; 

•  merging:  the  periodic  technique  by  which  all  the 

divergent  copies  are  synchronized  to  the  most  recent 
information  across  the  system;  and 

•  transition:  which  is  actually  the  end  of  a  file's 

delsync  classification  as  it  changes  to  synchronized. 

The  phase  which  is  important  to  the  cost/ per formance 
predictions  we  have  been  making  in  this  chapter  is  the  daily 
operation  phase;  in  fact,  we  will  assume  that  merges  are  done 
during  some  period  of  very  low  activity  such  as  the  middle  of 
the  night.  For  applications  such  as  world-wide  airline 
reservations,  there  may  not  be  any  such  period,  and  we  really 
should  account  for  some  extra  processing  contention  and  delay 
when  merges  take  place. 
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I'ht*  queuing  model  I'oc  e.ioh  delsyno  node  Is  shown  in 
Figure  4-J7.  We  see  thut  the  arrival  rate  of  (JR  requests  for 
which  the  r'equired  klata  ar'e  local  is  lOR  =  LQR  •  pv  local)  and 
these  get  high  priority  tieatment  in  the  local  queue. 
Actually,  the  probability  of  finding  data  locally  is  a 
function  of  time:  right  after  a  merge,  p( local)  =  1;  just 
before  a  merge,  p( local)  has  its  minimum  value,  less  than  1. 
For  our  first-order  statistics,  we  will  use  only  an  average 
value,  p(  local),  and  apply  it  on  the  average  over  the  entire 
time  period  between  merges.  Thus,  the  rate  at  which  *jR 
requests  must  be  passed  to  another  node  is 

FWR  =  L'jR  *  I  I -p(  loca  1  )  1 .  We  assume  these  external  requests 
will  be  uniformly  distributed  over  the  rem.iining  nodes.  This 
is  shown  by  the  solid  lines  of  Figure  along  with  the 
commun ioat ions  delays  incurred.  Similarly  the  local  node's 
share  of  JR  requests  processed  for  the  other  nodes  is  returned 
directly  to  those  nodes  via  the  solid  lines.  This  is  slightly 
different  from  the  control  flow  algorithm  of  section  ,1.1.1;  it 
is  equivalent  under  the  uniform  fixed  communication  delay  that 
we  have  assumed  between  any  two  nodes  in  the  network.  That 
is,  we  are  now  adding  an  assumption  that  the  identity  of  the 
copy  responding  "quickest"  will  be  uniformly  distributed  over 
the  remaining  nodes. 

The  dashed  lines  of  Figure  4-J8  represent  virtual  ring 
connections  which  are  used  to  handle  PI  retrievals.  Local  HI 
requests  are  put  out  onto  the  virtual  ring,  as  are  PI  results 


going  back  to  the  other  nodes  (Figure  U-27).  Results  from  the 
ring  of  previous  local  requests  sent  out  are  processed  locally 
to  choose  the  best  result  and  are  then  returned  to  the  user. 
Again  this  differs  from  the  control  flow  algorithm  of  section 
3.3.3  for  modeling  tr ac tab  i  1  i  ty  . 

We  can  express  the  average  delsync  response  time 

(developed  in  detail  in  section  B.2.4  of  the  appendices)  as 
RT  =  TU  •  [A/(A  +  1  )  ] 

+  N  »  (CD  t-  TBI)  /  [  (Afl  )*(B-.-l  )  ] 

+  ITQR+L 1-p( local) ]*2»CD)  •  B/ [ ( A+ 1 ) • ( B+ 1 ) ]  . 
Figure  4-29  shows  how  the  response  time  varies  with  the 
probability  that  QR  retrievals  can  be  handled  locally.  Figure 
4-30  shows  the  tradeoff  between  more  nodes  handling  more 
arrivals  with  higher  average  response  times  because  of  the 
longer  time  to  complete  a  circuit  of  the  virtual  ring. 
Dependence  of  response  time  on  the  proportion  of  QR  to  BI 
retrieval  requests  is  dramatically  demonstrated  in  Figure 
4-31:  the  more  activity  that  can  be  handled  locally,  the 
better  the  response.  Figure  4-32  shows  the  dependence  on  the 
update/retrieval  ratio. 

Network  messages  are  required  to  handle  both  the  best 
information  retrievals  and  the  small  portion  of  quick  response 
retrievals  which  cannot  be  answered  from  locally  stored 
information.  The  average  cost  is  thus 
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Figure  h-29.  Dependence  of  Response  Time  on  Locality 
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Figure  A-31.  Dependence  of  Response  Time  on  the  QR/BI  Ratio 
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gure  ^*-32.  Response  Time  Dependence  on  the  Update/Retrieval  Ratio 
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RESULTS  OF  THE  ANALYSIS 

1  COMPARISON  OF  THE  BASIC  MODELS 

In  order  to  compare  the  performance  of  the  centralized, 
master/slave  and  synchronized  management  schemes  using  the 
basic  models,  a  standard  set  of  parameter  values  is  used. 
These  are  listed  at  the  top  of  Figure  5-1  and  could  be  taken 
to  represent  units  of  seconds.  The  system  load  is  kept 
comparable  by  considering  100  terminals  in  the  centralized 
system  and  10  nodes  each  with  10  terminals  (for  a  system  total 
of  100  terminals)  in  the  distributed  cases.  The  response  time 
is  normalized  to  units  of  communication  delay  to  give  lower 
bound  results  and  the  actual  magnitude  of  the  response  times 
is  less  important  than  the  relative  positions  of  the  curves. 
Figure  5-1  shows  the  master/slave  scheme  has  better  average 
response  time  than  the  centralized  scheme  since  only  local 
retrievals  are  competing  with  system  updates  for  processing 
resources  at  each  node  rather  than  system  retrievals  competing 
with  system  updates  at  a  single  node.  Saturation  of  the 
central  site  also  occurs  for  much  lower  arrival  rates  than 
saturation  of  the  master  node  because  of  the  higher 


Figure  5“i.  Comparison  of  the  Basic  Models 


COMPARISON 

BRSIC  CfiSE 

NUMBER  TERM]NflL5/N0DE=  10 
COMMUNICATIONS  DELflY=  1.0 
UPDATE  SERVICE  T1ME=  0.2 
RETRIEVAL  SERVICE  TIME=  0. 1 
UPDATE/RETRIEVAL  RAT10=  0.2E 
NUMBER  OF  N0DES=  10 


centralized  degree  of  contention. 

Conclusion :  The  system  flow  diagrams  .are 

particularly  useful  to  explicitly  represent  the 
differences  among  competing  management  philosophies  so 
that  performance  predictions  can  be  made. 

The  sequential  nature  of  the  synchronized  model  stands 
out  as  a  clear  disadvantage  in  response  time  until  the  central 
site  and  master  node  are  s.aturated  and  operating  at  very  high 
utilizations.  Our  choice  of  a  communications  delay  between 
nodes  large  with  respect  to  processing  delays  within  nodes 
reflects  our  concern  with  long-haul  networks,  l.e., 
distributed  systems  of  geographically  dispersed  nodes. 
Setting  aside  this  characteristic  momentarily  and  allowing 
communications  delays  comp.irable  to  processing  times,  we  see 
from  Figure  b-2  ttiat  synchronized  management  competes  very 
well  with  the  other  schemes  in  a  situation  whlcii  is 
representative  of  the  parameter  values  for  a  local  network. 
For  this  reason  we  will  continue  to  model  synchronized 
management  and  compare  it  to  the  other  approaches. 

Conclusion ;  The  virtual  ring  with  token  and  tickets 
scheme  for  synchronized  management  does  not  compare  well 
with  centralized  or  master/slave  management  for  networks 
with  communications  delays  large  compared  to  transaction 
service  times  (i.e.,  for  long-distance  networks).  It 
does  compare  well  for  the  smaller  communications  delays 


typical  of  local  networks. 


The  cost  comparison  among  schemes  is  represented  in 
Figure  5-3.  The  number  of  messages  required  to  handle  a 
transaction  is  averaged  over  all  of  the  transaction  types. 
The  particular  points  corresponding  to  the  performance 
comparison  parameters  are  for  10  nodes  and  are  indicated  by 
the  arrows  in  the  figure.  Again,  the  master/slave  scheme 
gives  the  best  result  because  of  less  contention  than 
centralized  and  less  sequentiality  than  synchronized. 

Conclusion:  The  first-order  statistics,  average 
response  time  for  performance,  and  average  number  of 
messages  to  handle  a  transaction  for  cost,  do  indeed 
differentiate  among  the  management  schemes  in  a  formal 


gure  5-3-  Message  Flow  Comparison  for  Basic  Models 
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5.2  COMPARISON  BETWEEN  BASIC  AND  QR-Bi  MODELS 

A  direct  comparison  for  our  standard  parameter  set  of  the 
basic  and  QR-BI  models  of  master/slave  management  is  shown  in 
Figure  S-'l.  It  clearly  shows  that  introduction  of  BI 
retrievals,  which  must  be  serviced  by  the  master  instead  of 
locally  by  the  slaves,  introduces  additional  resource 
contention  in  the  master.  This  degrades  the  average  system 
response  time,  and  increasingly  so  as  the  QR/BI  proportion 
decreases.  The  average  number  of  messages  required  to  process 
a  transaction.  Figure  5-5,  reflects  the  same  effect.  The 
point  is  that  the  system  performance  is  completely  driven  by 
the  performance  of  the  master  node,  because  it  has  the  most 
work  to  do. 

In  contrast,  the  average  response  time  for  synchronized 
management.  Figure  5-6,  shows  significant  improvement  with 
increasing  amounts  of  QR  processing  that  can  be  done  between 
batches  of  Bis  and  updates.  Since  only  updates  require 
messages  for  circulation,  the  average  number  of  messages  per 
transaction  is  identical  for  both  the  basic  and  QR-BI  models. 

Delayed  synchronization  has  no  basic  model  since  it  was 
designed  specifically  for  the  QR-BI  case.  We  can  refer  back 
to  Figure  4-31  to  see  how  it  improves  with  an  increasing 
proportion  of  OR. 
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CGNCLUS ION :  The  data  timeliness  -  access  timeliness 
tradeoff,  as  manifested  in  the  GR  and  BI  retrieval 
options  under  the  specific  assumptions  made  about  the 
operational  environment: 

*  has  no  effect  on  centralized  management  (from  Figure 
4-4)  , 

*  degrades  the  performance  of  master/slave  management, 

*  significantly  improves  the  performance  of 
synchronized  management,  and 

*  supports  the  alternative  management  scheme  of 
delayed  synchron i za t ton  . 

5.3  COMPARISON  AMONG  THE  OR-BI  MODELS 

The  standard  set  of  parameter  values  used  to  compare  the 
performance  of  the  management  schemes  in  a  QR-BI  environment 
are  listed  at  the  top  of  the  figures  in  this  section.  In 
order  to  demonstrate  the  particular  effect  of  the  QR-BI 
proportion,  there  are  comparison  plots  for  three  values  of 
QR/BI.  With  QR/BIrO  in  Figure  5-7,  we  see  that  master/slave 
reduces  almost  to  centralized  since  all  transactions  are  sent 
to  the  master  for  processing.  Delayed  synchronization 
performs  poorly  by  comparison  with  all  the  others,  since  it 
was  designed  specifically  for  a  high  QR/BI  environment,  and  it 
does  not  look  like  the  synchronized  case,  because  the  (BI) 
retrieval  requests  are  circulating,  not  the  updates. 
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In  Figure  5-8  with  QR/BI=1,  retrievals  are  equally 
divided  between  QR  and  BI  types.  Master/slave  improves  some 
over  centralized  because  handling  the  QRs  locally  reduces 
contention  at  the  master.  Synchronized  improves  some  because 
QRs  may  be  processed  as  they  arrive  between  batches.  Delayed 
synchronization  improves  most  since  now  only  half  the 
retrievals  (Bis)  are  circulated,  while  nearly  half  (i.e.,  the 
local  QRs)  are  handled  locally.  In  fact,  the  delsync  approach 
now  competes  quite  well  with  the  others. 

The  improvements  for  synch,  delsync,  and  master/slave 
continue  relative  to  the  centralized  case.  Figure  5-9,  as 
QR/BI  continues  to  increase  and  the  greater  proportion  of 
local  activity  reduces  the  effects  of  the  communications 
delays  on  the  average  response  time. 

Conclusion ;  Delayed  synchronization  is  an 
appropriate  management  alternative  offering  significant 
performance  improvement  in  an  environment  which  has  a 
high  transaction  arrival  rate  with  a  large  proportion  of 
quick  response  requests. 

As  in  the  basic  case,  the  cost  comparison  among  the 
management  schemes  is  represented  by  the  network  message 
traffic  requirements,  Figures  5-10  and  5-11.  Again,  the 
average  is  computed  over  all  types  of  transactions  and  the 
particular  cases  for  the  standard  parameter  set  (10  nodes)  are 
indicated  by  arrows.  Figure  5-10  compares  the  schemes  for  a 
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QR/BI  ratio  of  1,  that  is,  retrievals  evenly  divided  between 
QR  and  BI  types.  A  higher  degree  of  local  activity  is 
represented  by  0R/BI=U  in  Figure  5-11. 

Conclusion :  Even  the  simple  and  somewhat 

inefficient  implementation  model  for  delayed 
synchronization  management  predicts  a  cost  that  is 
competitive  with  the  other  management  schemes  in  the 
many-transaction ,  high-QR  environment  for  which  delsync 
performs  better. 
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Chapter  6 

CONCLUSIONS  AND  IMPLICATIONS 

6.1  -SUMMARY  OF  CONCLUSIONS 

The  evaluation  methodology  that  has  been  developed  is  a 
valuable  tool  for  comparing  different  management  approaches  to 
distributed  database  systems.  The  flow  diagram  technique 
emphasizes  the  importance  of  analyzing  the  system  to  identify 
both  the  specific  control  paths  and  the  data  flow  that  are 
required  to  process  each  type  of  database  transaction.  It  is 
from  such  information  that  specific  cost/ per formance  results 
can  be  derived.  Because  few  distributed  database  systems  have 
been  implemented  and  those  mostly  serve  the  special-purpose 
needs  of  particular  applications,  the  cost/ per formance 
predictions  of  Chapter  b  remain  to  be  verified: 

*  The  virtual  ring  with  token  and  tickets  scheme  for 
synchronized  management  is  more  appropriate  for  local 
networks  than  for  long-di stance ,  geographically  dispersed 
networks . 

•  In  the  assumed  operational  environment,  the  quick 
response  and  best  information  retrieval  options 
significantly  improve  the  cost/ per formance  of 
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synchronized  management  while  providing  no  benefit  to 
centralized  management  and  actually  degrading  the 
cost/performance  of  master/slave  management. 

*  Delayed  synchronization  management  offers  significant 
performance  improvement  at  competitive  costs  for 
appropriate  applications  in  an  environment  where  the 
transaction  arrival  rate  is  high  and  the  proportion  of 
retrievals  that  are  quick  response  requests  is  large. 

Despite  the  many  assumptions  required  in  the  modeling  and 
the  simple  queuing  approach  of  Poisson  arrivals  and 
exponential  service  distributions,  the  wide  range  of  variation 
in  the  predictions  provides  very  useful  information  to 
distinguish  among  the  management  schemes.  Although  the 
specific  results  are  not  at  all  surprising,  in  that  they  do 
not  contradict  our  intuition  about  the  management  approaches, 
they  do  provide  more  formalism  to  the  comparisons  and  they 
will  allow  even  more  detailed  comparison  when  experimental 
data  become  available  to  verify  the  predictions. 

6.2  SUGGESTIONS  FOR  FURTHER  WORK 

6.2.1  Modeling  and  Analysis 

An  extension  to  the  analytical  work  which  might  be 
suggested  is  to  give  up  the  simple  representation  of 
communications  delay  as  a  constant  and  to  choose  a  queuing 
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model.  This  would  allow  considerat ion  of  any  extra  delays  in 
response  time  because  of  contention  over  the  communications 
resources.  However,  increasing  experience  with  broadcast 
contention  networks  like  Ethernet  and  consideration  of  sparse 
loading  conditions  for  packet  networks  tend  to  support  the 
simpler  assumption  of  constant  delays. 

A  refinement  of  the  models  which  would  be  useful  is  to 
include  more  detail  about  what  activity  occurs  within  each 
node  (see  [BUCC79],  for  example).  Contributions  from  time 
spent  in  the  central  processor  could  be  distinguished  from 
input/output  contributions,  and  the  effects  of 
multiprogramming  and  multiple  storage  units  could  be 
considered.  It  might  also  be  useful  to  give  up  the  assumption 
that  all  nodes  are  identical.  Particularly  in  the  case  of 
master/slave  management,  the  system  performance  would  be 
improved  by  providing  the  master  with  faster  or  more  powerful 
equipment  while  the  percentage  increase  in  the  total  system 
purchase  price  might  not  be  very  significant.  A  sample 
question  which  could  be  answered  this  way  is  how  much  faster 
would  the  master  node  have  to  be  in  order  not  to  saturate 
before  the  individual  slave  nodes  do. 

A  useful  extension  of  the  modeling  would  be  to  give  up 
the  assumption  that  transaction  arrivals  are  uniformly 
distributed  over  all  terminals  of  all  nodes.  This  would  allow 
analysis  of  clustered  transaction  arrivals  against  complete 


duplicate  copies  of  the  database  and  against  incomplete  copies 
where  the  degree  of  locality  of  reference  in  the  arrivals 
becomes  important. 

Adding  results  from  such  further  work  to  the  basic 
results  already  developed  would  probably  give  enough 
information  to  develop  a  table  of  performance  predictions 
based  on  both  the  application  characteristics  and  the 
management  scheme  as  implemented  on  some  specific  network.  As 
data  become  available  to  substantiate  some  of  the  predictions, 
the  table  might  begin  to  offer  guidelines  for  system  designers 
to  use  in  matching  applications  with  management  approaches . 
With  such  a  goal  in  mind,  it  might  be  more  appropriate  to  use 
the  operational  analysis  approach  to  queuing  network  solutions 
initiated  by  Buzen  (see  particularly  [DENN78])  since  it  is 
based  on  operational  experience,  rather  than  relying  on  the 
stochastic  modeling  that  is  often  difficult  to  relate  closely 
to  real  systems.  Data  gathered  from  experience  with  existing 
networks  should  begin  to  provide  the  parameter  values  which 
will  be  necessary  to  carry  out  the  computations. 


6.2.2  Data  Activity  Index 


An  interesting  extension  to  the  work  of  this  dissertation 
(which  is  much  smaller  in  scope  than  the  above  suggestions  for 
modeling  and  analysis)  is  to  pursue  in  more  detail  the 
mechanisms  to  support  the  quick  response  and  best  information 
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retrieval  options  as  an  explicit  implementation  of  the 
timeliness  tradeoff.  The  data  quality  indicator  (DQl,  see 
section  3.2.3)  has  potential  for  general  use  in  multiple-copy 
distributed  systems  where  there  is  any  likelihood  for 
retrieval  accesses  to  be  delayed  due  to  higher  priority 
traffic  or  operations.  Availability  of  local  data  with 
quality  information  may  well  serve  the  immediate  or  temporary 
requirements  of  a  variety  of  users. 

Data  quality  indicators  (section  3-2.3)  could  represent  a 
crude  indication  of  what  the  history  of  update  activity  on  a 
particular  data  item  had  been,  if  there  were  some  initial  time 
reference  point  and  a  well-established  or  known  function  of 
time  to  describe  the  activity.  We  could  consider  adding  to 
the  DQI  a  timestamp  for  database  insertion  of  the  item,  but 
the  overhead  of  the  DQI  is  already  so  high  that  its 
practicability  is  dubious.  Instead,  let  us  step  up  a  level  of 
granularity  again,  and  consider  files  (as  a  grouping  of  items) 
rather  than  individual  data  items.  Just  as  different  DQIs 
apply  to  different  types  of  data,  so  do  different  update 
patterns  go  with  different  types  of  data.  If  we  could 
characterize  those  patterns,  we  could  associate  with  each  file 
a  data  activity  index  which  would  give  more  information  to 
users  about  the  data. 

For  example,  text  files  begin  with  insertions,  tend  to 
have  a  high  update  rate  for  a  while  and  then  taper  off  as 
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revisions  are  made  and  a  final  version  is  approached.  This 
might  be  approximated  by  an  expoi  antial  distribution  for  the 
update  frequency  versus  time;  and  not  everyone  handles  his 
text  files  this  way.  So  the  areas  to  be  considered  within 
this  topic  would  include; 

•  characterization  of  data  types  and  update  frequencies, 

•  appropriate  parameters  for  characterizing  the  descriptive 
distributions,  and 

•  what  the  collection  of  activity  histories  would  entail. 

It  is  interesting  to  notice  that  the  technology  of  demand 
paging  and  automatic  memory  management  already  are  involved 
with  keeping  track  of  data  activity  at  a  low  level  of 
sophistication.  For  example,  there  are  counters  or  use  bits 
referred  to  by  procedures  such  as  least  recently  used  page 
removal  algorithms.  The  data  activity  index  is  a  sort  of 
natural  extension  that  is  appropriate  to  a  whole  network 
context  rather  than  to  the  memory  hierarchy  of  a  single 
machine . 

6. 3  SOME  IMPLICATIONS 

The  specific  results  which  have  been  presented  to  compare 
the  cost/per formance  of  the  various  management  schemes  are 
important  primarily  as  examples  of  what  can  be  done  using  the 
evaluation  methodology.  There  are  other  proposals  for  both 
master/slave  and  synchronized  management  schemes  that  can  be 


evaluated;  we  chose  the  specific  examples  of  n-host  resiliency 
for  a  master/slave  scheme  and  the  virtual  ring  with  token  and 
tickets  for  a  synchronized  scheme.  Furthermore,  as  new 
management  proposals  are  made,  they  too  can  be  evaluated,  just 
as  the  delayed  synchronization  scheme  was. 

The  evaluation  methodology  thus  gives  us  an  opportunity 
to  lay  out  an  entire  spectrum  of  management  strategies  and 
investigate  their  appropriateness  for  particular  application 
and  network  operating  characteristics.  The  emphasis  has  been 
on  the  average  system  cost ' per formance ;  more  detail  in  the 
queuing  network  model  may  lead  to  consideration  of  other 
factors . 

In  addition  to  this  general  technological  impact,  it  will 
be  interesting  to  see  what  use  can  be  made  of  the  specific 
proposal  for  a  timeliness  tradeoff.  It  may  not  be  too 
difficult  to  augment  some  existing  database  management  system 
to  handle  quick  response  and  best  information  retrievals. 
This  would  offer  experience  that  might  be  valuable  in  building 
synchronized  distributed  systems,  where  we  have  seen  that  the 
explicit  tradeoff  in  the  two  retrieval  types  has  the  potential 
for  really  improving  the  average  system  cost/performance. 
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Appendix  A 


DISTRIBUTED  DATABASE  RESEARCH 


A. 1  INTRODUCTION 

Managing  a  distributed  database  begins  with  all  of  the 
problems  involved  in  managing  a  centralized  database  and  is 
complicated  by  having  the  management  responsibility  and  the 
database  itself  divided  among  multiple  computer  systems  which 
may  be  geographically  as  well  as  physically  separate.  The 
first  purpose  of  this  appendix  is  to  outline  those  aspects  of 
the  problems  which  are  specifically  associated  wit.h  the 
distributed  nature  of  the  data  and  its  management.  The  second 
purpose  is  to  briefly  survey  the  work  that  has  been  done  to 
date  toward  solving  some  of  these  problems  so  that  real  DDBMSs 
could  be  built. 

The  discussion  is  divided  into  problem  areas  which  can  be 
briefly  characterized  as: 

integrity :  correctness  of  individual  data  items  in  the 
context  of  the  whole  database  as  a  model  representing 
some  enterprise,  where  a  data  item  is  the  smallest 
independently  accessible  unit  of  data; 

organization ;  location  and  arrangement  of  data  and 
directories  to  facilitate  efficient  responses  to  user 
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requests ; 

security:  protection  of  data  from  accidental  or  malicious 
corruption  and  prevention  of  unauthorized  access; 
data  incompatibility:  limitations  on  data  usage  because 
of  its  structure  or  representation; 

rel iabil ity :  availability,  failure  recovery,  explicit 
indications  of  database  status  with  regard  to  individual 
operations,  and  prevention  or  reduction  of  situations 
where  data  is  inaccessible;  and 

implementation  experience  in  building  systems  for  actual 
use . 

A  difficulty  often  encountered  in  discussions  of 
distributed  databases  is  the  lack  of  an  agreed-upon 
vocabulary.  For  the  purposes  of  this  discussion,  a  database 
transaction  is  the  general  term  for  any  user  interaction  with 
the  database  (DB).  Transactions  requiring  only  reads  from  the 
DB  are  cal|led  retrievals;  transactions  requiring  writes  are 
called  updates  (including  insertions  and  deletions).  In 
general,  an  update  will  require  reading  something  before  the 
writing  is  done.  Transactions  will  involve  some  specific  (set 
of)  data  item(s),  the  smallest  accessible  uriit(s)  of  the 
database.  Each  data  item  consists  of  a  value  and  a  timestamp 
(node  identification  and  clock  time)  for  insertion  or  last 
update  of  the  value.  The  granul ar ity  or  what  size  a  data  item 
actually  is  will  not  be  discussed  except  in  terms  of  storage 
costs . 


with 


this  background,  let  us  proceed  to  examine  some 
problems.  The  discussions  are  intended  to  include; 
definition  of  the  problem  from  a  centralized  point  of  view, 
additional  consid erat ions  or  complications  in  the  distributed 
environment,  a  brief  presentation  of  some  proposed  solutions, 
and  mention  of  known  restrictions  or  limitations  of  the 
solutions.  In  general,  the  choice  of  solutions  discussed  is 
based  on  details  which  were  referred  to  in  the  body  of  the 
d  isser tation  . 

A. 2  INTEGRITY 

In  the  general  context  of  database  management,  the 
problem  of  integrity  includes  both:  (1)  protect. on  of  data 

items  from  loss  or  destruction  by  errors  and  (2)  assurance 
that  data  values  are  correct  within  the  framework  of  the  whole 
database  as  a  model  representing  some  enterprise  (a  further 
discussion  can  be  found  in  Chapter  20  of  [DATE75]).  In  the 
face  of  no  really  standard  terminology,  for  the  purposes  of 
this  thesis  (1)  above  will  be  referred  to  as  the  problem  of 
back-up  and  recovery  and  (2)  above  will  be  consistency.  These 
problems,  and  their  solutions,  tend  to  interact  in  practice 
and  the  terminology  is  often  intertwined  and  interchanged. 

A  distributed  DBMS  must  be  concerned  with  the  same 
integrity  problems  as  a  centralized  system  and  also  deal  with 
certain  complications  caused  by  the  distribution.  In 


p;u'ticular,  back-up  and  recovtuy  must  handle  a  number  of 
separate  sites  connected  by  some  communications  media.  If 
communications  fail,  for  example,  the  recovery  algorithms  may 
not  be  able  to  get  to  the  back-up  data  they  need  or  even 
decide  what  data  are  relevant.  In  addition,  consistency  must 
handle  much  larger  (than  centralized)  communication  delays  and 
multiple  copies  of  data.  This  may  mean  that  without  constant 
overhead  communications  such  as  "I'm  okay,  are  you  okay?",  it 
will  be  impossible  for  any  node  to  differentiate  between 
failure  of  another  nodt'  and  extra-long  communications  delays. 
If  timeout  mechanisms  are  used,  then  the  possibility  of 
duplicate  messages  must  be  hatuiled  (e.g.,  at  time-out  a 
failure  decision  is  reached  and  a  retry  is  requested,  and  then 
the  original  result  sliows  up).  Simple  extension  of 
centralized  database  control  strategies,  such  as  locking  data 
items  to  provide  mutual  exclusion  or  using  pointers  instead  of 
redundancy,  may  not  be  appropriate  across  physically  separate 
computer  systems  whijh  may  also  be  geographically  dispersed. 

For  example,  distributed  systems  without  centralized 
control  deliberately  give  up  the  ability  to  centrally  control 
t  lie  total  system  state.  Since  the  nodes  of  the  system 
•'operate  in  control  and  since  communication  delays  between 
<n>  pair  of  nodes  are  unpredictable,  there  is  no  time 
•t'Tence  common  among  the  nodes  and  no  global  state 
'  o-n.ition  which  can  be  collected  that  reflects  the  state  of 
.»•  svstem  at  a  given  time  [I.KLA77  1.  This  complicates 


1 10 


par'Licui  ;u'  Jdtabaae  rnatiaf^  emet)  t  sLr'tjLegies  commonly  relied  on 
tor'  control  in  centralized  systems.  Centralized  control  of 
locking,  for  example,  is  often  used  to  prevent  Interfering 
accesses  to  data  items  being  referred  to  by  multiple  users. 
When  lock  control  is  distributed  among  dispersed  nodes,  the 
communications  delays  in  transferr'ing  status  information  make 
it  impossible  for  any  node  to  collect  the  simultaneously 
accurate  global  state  data  needed  to  detect  or  resolve 
dead  lock  . 

Another  problem  is  that  no  two  nodes  of  a  general 
distributed  system  can  be  proven  [LK1.A7Y]  to  iderrtically 
observe  the  same  sev^uence  of  events  (e.g.,  internal  state 
ctianges)  .  The  events  are  observable  only  through  the  external 
communications  generated,  and  the  delay  time  involved  in 
rev’eiving  such  Ckimmun  icat  ions  is  comparable  to  the  time 
uetween  tlie  successive  events.  If  observations  cannot  even  be 
proveti  the  same  for  a  given  event  sequence,  it  will  be 
difficult  to  ensure  the  same  execution  sequences  (as  for 
update  applications)  at  differetiu  nodes.  This  is  even  further 
complicated,  since  no  two  nodes  can  be  proven  [LELA77]  to  have 
the  same  global  view  of  even  a  particular  system  subset. 

A. 2.1  Back-up  and  Recovery 

A  variety  of  back-up  and  recovery  techniques  (for 
centralized  systems)  have  been  discussed  in  (VfclRHYSl  from  the 
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point  of  view  that  a  failure  is  "an  event  at  which  the  system 
does  not  perform  according  to  specifications"  and  where  an 
error  is  "a  piece  of  information  which  can  cause  a  failure." 
The  object  of  the  recovery  techniques  is  to  restore  the 
database  to  a  state  which  is  useable  within  the  system,  by 
using  back-up  information  which  has  been  collected  during 
system  operation. 

In  a  distributed  system,  the  previously  mentioned 
communication  delays  and  lack  of  global  state  information 
immediately  complicate  recovery  solutions.  If  a  single  node 
fails  and  the  rest  of  the  system  continues  to  operate,  there 
must  be  a  way  for  a  recovering  node  to  catch  up  on  everything 
that  has  happened  since  it  failed.  The  back-up  information 
must  have  been  kept  in  a  way  and  in  places  that  facilitate 
such  recovery. 

Crash  recovery  has  been  looked  at  by  Lampson  and  Sturgis 
[LAMB76]  in  the  context  of  "atomic  transactions."  A 
transaction  is  simply  a  user-designated  sequence  of  read  and 
write  commands  to  be  sent  to  the  file  system.  The  atomic 
property  for  transactions  is  aimed  at  ensuring  that  after 
recovery  from  a  system  crash,  either  all  of  the  writes  have 
been  executed  or  none  have.  The  technique  for  accomplishing 
this  assumes  that  the  processes  which  write  on  the  system 
storage  media  can  guarantee  only  three  possible  outcomes  for 
any  write  action: 
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1.  it  is  not  pert'onneil  nt  all, 

if.  it  is  performed  incorrectly  and  the  error  is 

detectable,  or 

3.  it  is  performed  completely  and  correctly. 

An  "intentions  list"  records  in  non-volatile  storage  the 
actions  necessary  to  carry  out  the  write  commands  of  a 
transaction.  Following  a  crash,  tiie  recovery  technique  will 
examine  all  intentions  lists,  erasing  completed  ones  and 
carrying  out  the  actions  specified  in  the  rest.  Each  list  is 
deleted  when  all  of  its  actions  are  completed.  Within  sucli  a 
framework,  Lampson  and  Sturgis  sketch  a  proof  that  three-state 
locks  provide  proper  operation  of  a  multiple-node  system  in 
the  face  of  passible  crasfies  and  recoveries. 

Three-state  locking  has  also  been  proposed  to  handle 
mutual  consistency  among  multiple  copies  of  data,  without 
deadlock.  Peebles  has  pointed  out,  however  [PEEB77],  that 
while  the  solutions  are  deadlock-free  for  multiple  copies  of  a 
single  data  item,  they  do  not  generalize  to  mul 1 1  pi e- i tern 
updates  whose  write-sets  overlap.  For  example,  if  transaction 
1  required  updating  values  of  data  items  A,  B,  C,  and  D  while 
transaction  2  required  updating  B,  C,  and  E,  the  three-state 
lock  solutions  so  far  proposed  are  not  adequate.  This  general 
case  would  require  an  accompanying  scheme  to  deal  with 
deadlock . 
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A. 2. 2  Consistency 

The  problem  of  consistency  can  be  further  divided  into 
two  parts:  (1)  checking  the  accuracy  of  every  data  item  value 
entered  into  the  database,  and  (2)  maintaining  that  accuracy 
in  the  face  of  multiple,  concurrent  accesses  involving  updates 
which  may  be  potentially  contradictory  or  conflicting. 

A. 2. 2. 1  Accuracy  . 

Accuracy  checking  is  usually  handled  from  the  point  of 
view  of  plausibility  rather  than  trying  to  check  the  value  of 
each  item  in  relation  to  all  other  items  in  the  database. 
Such  plausibility  criteria  can  be  established  through 
consistency  (sometimes  called  integrity)  constraints  which 
usually  amount  to  specification  of  some  set  properties.  For 
example,  in  a  banking  system,  deposits  should  have  values 
greater  than  zero  and  should  be  accepted  only  for  accounts 
known  to  exist,  and  the  day's  total  balance  should  equal  the 
prior  balance  plus  deposits  less  withdrawals.  Cancelling 
errors,  such  as  can  occur  in  sums  for  example,  will  not  be 
detectable  in  this  way. 

Consistency  constraints  are  still  probably  the 
appropriate  way  to  handle  the  accuracy  problems  in  a  DDBMS  as 
well  as  in  a  centralized  system.  Some  additional 
considerations  involve:  (1)  needing  to  store  constraints 

along  with  data,  (2)  keeping  multiple  copies  of  constraints 
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consistent  as  well  as  copies  of  data,  and  (3)  handling 
partitioned  data  where  constraints  apply  to  data  aggregates 
which  span  machines.  There  does  not  yet  seem  to  be  any 
literature  which  specifically  addresses  these  issues. 

A. 2. 2. 2  Concurrent  Access  Control. 

Concurrent  access  control  is  a  problem  whenever  multiple 
users  are  allowed  to  work  simultaneously  on  a  database. 
Particularly  when  a  database  transaction  can  involve  more  than 
a  single  data  item,  it  is  important  to  prevent  interfering 
accesses  by  simultaneous  transactions  involving  updates.  In  a 
distributed  database  context,  we  can  call  this  "internal” 
consistency  to  show  it  refers  to  non-redundant  data  within  a 
single-copy  database.  For  example,  consider  two  database 
users,  A  and  B,  both  accessing  a  payroll  file.  A's  job  is  to 
give  all  employees  in  department  10  a  b%  raise  in  salary.  B’s 
job  is  to  give  all  employees  with  salaries  less  than  $5000  a 
$500  raise.  If  C  and  D  had  salaries  of  $4800,  the  new 
salaries  could  be  either  $5040  (if  A  updated  first)  or  $5300 
(if  B  updated  first).  Concurrent  updating  by  A  and  B  could 
end  up  with  C  and  D  getting  different  salaries  --  clearly  an 
undesirable,  and  probably  an  incorrect  result.  Even  if  only  C 
had  been  involved  in  the  update,  there  is  a  question  as  to 
which  result  was  actually  intended  for  C.  Internal 
consistency  is  usually  handled  by  locking  the  data  items  to  be 
updated  so  that  concurrent  accesses  are  prevented  and 
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operations  are  strictly  serialized.  The  order  is  usually 
established  according  to  some  predefined  scheme  based  on 
mechanisms  such  as  timestamping  update  requests  or  assigning 
priorities  to  all  users.  Such  a  scheme  would  have  to 
represent  a  central  or  universally  agreed-upon  policy  to  be 
applied  uniformly  by  each  node. 

In  addition  to  internal  consistency,  a  DDBMS  must  further 
provide  "mutual"  consistency  among  any  copies  of  data 
partitions  or  an  entire  database.  That  is,  if  all  updating 
activity  stops,  the  copies  must  be  identical  after  some 
reasonable  transient  time.  In  general,  to  provide  mutual 
consistency  in  a  distributed  database,  it  has  been  assumed 
that  application  of  any  update  must  be  synchronized  among  all 
data  copies.  That  is,  updates  for  the  same  data  item  must  be 
ordered  so  that  their  application  will  not  give  different 
results  in  different  copies. 

The  various  approaches  to  concurrent  access  control  have 
been  stated  in  diverse  ways  and  with  widely  differing 
terminologies.  Performance  comparisons  of  the  algorithms  are 
just  starting  to  be  done  [GARC78]  and  formalisms  for 
comparison  are  beginning  to  develop  [GELE78  and  BERN78]. 
LeLann  has  suggested  [LELA78]  that  one  way  to  group  the 
research  on  solutions  is  according  to  the  underlying 
mechanisms : 

*  physical  time. 
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•  logical  time  or  timestamps, 

•  explicit  control  privilege, 

•  event  counts  or  sequence  identifiers. 

These  categories  provide  a  convenient  outline  for  our 
discussions  here,  and  we  will  discuss  one  example  for  each. 

A. 2. 2. 2.1  Physical  Time. 

Lamport  has  shown  that  mutual  exclusion  in  the  use  of 
distributed  resources  can  be  achieved  using  physical  clocks 
which  satisfy  certain  conditions  (discussed  in  the  next 
paragraph)  [LAMP78].  The  basis  of  this  approach  is  that  the 
sending  and  receiving  of  messages  in  a  distributed  system 
determine  a  unique  partial  ordering  of  the  message  events. 
For  example,  within  a  single  node,  the  events  are  ordered  by 
their  times  of  occurrence,  so  that  if  event  A  occurred  before 
event  B  we  have  the  logical  relation:  A  "happened  before"  B. 
Between  nodes  the  ordering  of  events  can  be  established  only 
from  physically  related  events  such  as  A  representing  the 
sending  of  a  particular  message  by  one  node  and  B  representing 
the  receipt  of  that  same  message  by  another  node.  In  this 
case,  the  two  events  are  also  logically  related  by:  A 

"happened  before"  B.  Within  the  entire  distributed  system, 
"happened  before"  is  only  a  partial  ordering  since  there  will 
be  sets  of  events  which  are  not  logically  related.  The 
partial  ordering  can  be  extended  to  a  total  ordering  by  adding 
a  rule  to  tell  which  of  two  unrelated  (also  called  concurrent 
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if  they  happen  between  sets  of  related  events)  events  came 
first.  Any  such  total  ordering  effectively  eliminates 
concurrent  accesses;  Lamport  introduces  physical  clocks  so 
that  the  rule  and  the  resultant  total  ordering  will  give  a 
serialization  to  the  events  that  is  appropriate  as  far  as  some 
(hypothetical)  universal  observer  who  sees  the  operation  of 
the  entire  system  is  concerned. 

The  physical  clocks  are  assumed  to  run  at  approximately 
the  correct  rate  (i.e.,  different  from  the  true  rate  only  by 
some  fraction  kappa  <<  1),  and  to  be  sufficiently  well 

synchronized  (i.e.,  within  some  small  epsilon  of  each  other). 
For  mu  a  quantity  that  is  less  than  the  shortest  transmission 
time  for  interprocess  (internode)  messages,  then  if 

epsilon  /  (  1  -  kappa  )  <  or  =  mu 
is  true,  anomalous  behavior  is  impossible.  Within  these 
assumptions  (the  certain  conditions  mentioned  above),  the 
implementation  rules  for  the  clocks  to  provide  the  total 
ordering  of  events  are:  the  clock  rates  are  continuous 

between  events,  messages  are  timestamped  by  their  senders,  and 
message  receivers  reset  their  clocks  to  the  sum  of  the 
timestamp  and  the  minimum  message  delay  (mu),  if  that  sum  is 
greater  than  their  local  clock  time. 

The  main  reasons  for  discussing  this  physical  time  scheme 
are  that  it  is  a  formal  treatment  with  proofs  and  that  it  lays 
the  groundwork  on  which  the  next  category  of  research,  logical 
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time  or  timestamps,  is  built. 


h. 2. 2.2.2  Logical  Time. 


Logical  time,  in  the  form  of  timestamps,  was  probably 
first  introduced  into  distributed  database  systems  by  Johnson 
and  Thomas  [JOHN75].  Their  timestamp  consisted  of  a  local 
time  (from  a  physical  clock)  concatenated  with  a  site 
identification,  and  was  used  to  resolve  the  order  of 
application  of  conflicting  updates  for  multiple  copies  of 
data.  One  timestamp  was  said  to  precede  another  if  the  first 
time  preceded  the  second,  or  if  the  times  were  equal  and  the 
first  node  ID  was  smaller  than  the  second. 


Thomas  went  on  to  use  timestamps  to  completely  coordinate 
multiple-copy  distributed  database  updating  in  the  majority 
consensus  algorithm  [THOM76].  In  his  scheme,  each  item  in  the 
database  has  associated  with  it  a  timestamp  which  represents 
the  last  time  that  item's  value  was  updated.  Update  requests 
are  assigned  their  own  timestamps  at  origination  and  consist 
of  a  list  of  the  items  to  be  updated  and  a  list  of  the  base 
items  (with  timestamps)  on  which  the  update  is  predicated. 
The  database  manager  in  each  node  of  the  distributed  system 
compares  the  base  item  timestamps  in  the  request  with  the 
items  in  his  own  copy  of  the  database.  If  any  of  the  request 
timestamps  are  obsolete  (i.e.,  older  than  the  local  copy 
timestamps),  the  update  request  is  rejected  (and  may  be  saved 


for  later  resubmission).  If  the  timestamps  are  current  and 
the  request  does  not  conflict  with  any  others  pending  at  that 
site,  the  vote  is  to  accept  and  the  request  becomes  pending  at 
that  site.  A  conflict  occurs  whenever  the  update  items  of  one 
request  overlap  with  the  base  items  of  another  request.  The 
conflicts  are  handled  by  assigning  a  priority  order  to  the 
nodes  of  the  network,  and  giving  each  update  request  the 
priority  of  its  originating  node.  Requests  conflicting  with 
another  of  higher  priority  are  rejected;  requests  conflicting 
with  another  of  lower  priority  are  set  aside  and  voting  is 
deferred.  The  originating  node  collects  the  votes  on  its 
update  request;  a  majority  consensus  on  acceptance  allows 
update  application  in  parallel  throughout  the  system.  Thomas 
showed  that  any  rejection  vote  will  ensure  that  a  majority 
vote  for  acceptance  cannot  be  achieved. 

The  major  difference  between  Thomas's  and  Lamport's  use 
of  timestamps  is  in  the  use  of  the  physical  clocks.  Lamport 
continually  resets  the  clocks  to  maintain  synchronization  as 
closely  as  possible.  Instead,  Thomas  prevents  sequencing 
anomalies  by  having  the  update  timestamp  be  assigned  by  the 
requesting  node  as  the  maximum  value  of  either  the  local  clock 
time  or  the  latest  timestamp  of  any  of  the  base  variables  plus 
one.  Thus  the  clocks  can  run  at  different  rates,  without  any 
synchronization,  and  are  only  limited  by  the  restriction  that 
they  never  be  set  backwards. 


A  variety  of  other  concurrent  access  control  algorithms 
based  on  timestamps  have  been  proposed  [ROSE78,  BADA78, 
GELE78].  One  of  these  is  SDD-1  [ROTH77],  a  DDBMS  being 
designed  and  built  by  Computer  Corporation  of  America,  which 
is  discussed  in  the  section  on  Implementations. 


A. 2. 2. 2. 3  Explicit  Control  Privilege. 

A  different  approach  to  concurrency  is  based  on  control 
being  explicitly  assigned  to  a  particular  node.  This  would, 
in  effect,  be  a  DDBMS  with  distributed  data  and  centralized 
control . 

Alsberg  and  Day  have  suggested  a  f aul t- tolerant  approach 
to  explicit  control,  called  n-host  resiliency  [ALSB76],  where 
n  hosts  (n  >  0,  n  <  or  =  total  number  of  hosts)  must  be  aware 
of  an  update  by  receiving  the  request  before  either 
acknowledgement  to  the  user  or  application  to  any  local 
database  copy.  The  point  is  that  only  one  node  is  involved  in 
conflict  resolution  or  update  sequencing  (probably  the 
simplest  possible  solution)  and  the  goal  is  to  provide 
reliable  service  (or  be  resilient)  in  the  face  of  crashes 
(i.e.,  avoid  dependence  on  a  critical  resource).  To  achieve 
this,  the  network  hosts  are  divided  into  three  groups: 

1.  users,  which  receive  queries  and  updates  from  the 

external  world. 


2.  3  unique  primary,  and 

3.  back-ups,  which  are  explicitly  linearly  ordered  in 

some  arbitrary  prearranged  manner. 

Only  the  primary  and  back-ups  have  database  copies.  All 
updates  are  sent  to  the  primary,  who  resolves  any  conflicts, 
applies  the  update  to  its  local  data  copy  and  starts  the 
request  down  the  back-up  chain.  Each  back-up  acknowledges 
receipt  of  the  update  request  from  its  predecessor  and  sends 
it  on  down  the  chain.  The  n-th  node  to  apply  the  update 
(i.e.,  the  n-lst  back-up)  returns  the  acknowledgement  to  the 
update  originator,  and  the  user  can  then  proceed  while 
propagation  of  the  change  through  the  back-up  chain  completes. 
The  acknowledgement  scheme  supplies  the  n-host  resiliency,  by 
ensuring  updates  are  never  acknowledged  to  a  user  as  complete 
until  n  hosts  know  about  them.  In  this  way,  a  crash  of  the 
primary  or  any  other  node  does  not  cripple  the  system. 
Time-outs  are  used  on  acknowledgement  expectations  to  detect 
any  host  failures.  If  a  back-up  fails,  the  chain  is 
reestablished  by  linking  the  failed  node’s  predecessor  and 
successor.  If  this  is  not  possible,  a  network  partition 
occurs  and  consistency  problems  are  unavoidable  in  the  general 
case.  If  the  primary  fails,  the  back-ups  will  select  a  new 
primary.  The  point  is  that  in  general,  operation  is 
maintained  despite  failures  unless  fewer  than  n  hosts  remain. 


A  variation  on  this  scheme  has  been  proposed  which  allows 
broadcast  of  updates  instead  of  linear  propagation  [GRAP76]. 
A  similar  scheme  has  been  used  by  Stonebraker  [STON76]  for  a 
distributed  version  of  INGRES,  which  is  discussed  in  the 
section  on  Implementations. 

A.2.2.2.4  Sequence  Identifiers. 

The  fourth  category  of  research  into  concurrency  control 
uses  event  counts  or  sequence  identifiers  to  implement 
distributed  control  over  a  distributed  database  without 
referring  to  either  real  clocks  or  timestamps. 

LeLann  has  suggested  [LELAT8]  explicitly  sequencing 
update  requests  by  use  of  tickets  similar  to  ones  used  in  a 
bakery  to  assign  service  order  to  the  customers.  The  ticket 
order  ensures  that  updates  are  processed  in  exactly  the  same 
order  throughout  the  distributed  system.  To  eliminate 
concurrency  in  accessing  (as  well  as  to  avoid  deadlock  over 
data  resources)  and  still  have  distributed  control  over  the 
DDB,  the  database  hosts  of  the  system  are  arranged  in  a 
virtual  ring  by  assigning  permanent  identification  numbers  in 
a  sequential  increasing  fashion  around  the  ring  so  that  a 
predecessor  and  successor  are  defined  for  each  node.  A 
special  message  called  the  control  token  is  circulated  around 
the  virtual  ring  to  implement  the  ticketing  facility.  Tickets 
can  be  taken  from  the  "dispenser"  (the  control  token)  only  by 
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the  node  owning  the  token  and  are  assigned  to  pending  update 
requests  only  after  the  control  token  has  been  accepted  by 
that  node's  ring  successor.  Thus  when  an  update  request 
arrives  at  a  particular  node,  it  is  assigned  the  next  ticket 
that  the  node  has  available.  If  there  are  no  tickets  left, 
the  request  must  wait  in  a  pending  queue  until  the  control 
token  arrives  (back  at  that  node)  and  the  node  can  get  more 
tickets . 

Ticketed  update  requests  are  sent  throughout  the  system. 
All  nodes  are  restricted  to  applying  updates  in  ticket  order, 
so  if  preceding  ticket  numbers  are  missing,  a  newly  arriving 
ticketed  update  must  be  queued  until  all  lower  ticket  numbers 
have  been  processed.  The  token  effectively  acts  in  place  of  a* 
centralized,  master  clock  to  provide  global  event  sequencing. 

LeLann  discusses  several  performance  variations  on  the 
basic  scheme  to  allow  increased  parallelism  within  the  system. 
One  way  involves  preallocation  of  tickets  for  updates  not  yet 
pending,  but  expected  before  the  control  token  returns.  This 
requires  that  when  the  control  token  does  return,  any 
left-over,  unused  tickets  must  be  flushed  by  sending  out  null 
updates.  Otherwise  one  unused  ticket  could  hold  up  all 
subsequently  assigned  ones  indefinitely. 

The  particular  advantage  to  the  virtual  ring  with  token 
and  tickets  as  a  concurrency  solution  is  that  there  are  no 
gaps  in  the  ordering  specified  by  the  tickets  the  way  there 


are  with  timestamps.  On  receiving  a  ticketed  update  request, 
a  node  always  knows  how  many  mlssltiK  updates  will  have  to  be 
received  and  processed  before  the  current  one,  since  It  knows 
the  ticket  number  of  the  last  one  It  did  process. 

A.2.2.2.b  Conclusion 

This  section  has  presented  a  variety  of  solutions  for  the 
concurrent  access  control  problem.  The  sample  of  schemes  that 
has  beet)  presented  is  Intended  to  point  out  the  diversity  of 
approach;  analysis  of  assumptions  and  performance  will  be 
required  to  determine  which  solution  is  appropriate  for  a 
particular  DDB  system.  Such  analysis  is  Just  beRlnning  to  be 
done  so  that  comparisons  can  be  made  [GARC78,  BEHN78]. 

A. 3  SECURITY 

The  issue  of  security  in  distributed  database  management 
systems  has  not  yet  received  a  great  deal  of  attention.  The 
problems  will  almost  certainly  be  at  least  as  large  as  the  sum 
of  the  problems  of  protecting  the  data  stored  within  each 
node's  piece  of  the  database  (internal  security)  and 
maintaining  the  security  of  communications  between  nodes. 
Internal  security  includes  such  problems  as:  verification  of 
user  identities,  control  of  user  access  to  data,  control  of 
user  operations  on  data,  and  actual  physical  security  of  the 
computers  and  peripherals.  It  Is  becoming  less  clear  that 


even  these  are  the  same  as  for  centralized  systems.  A 
particular  example  coming  into  focus  as  networks  begin  to 
develop  general  access  mechanisms  and  routing  schemes  which 
involve  multiple  nodes  is  the  difficulty  involved  in 
determining  an  individual  user's  location  or  network  entry 
method.  Even  if  he  is  asked,  there  is  no  way  to  validate  his 
answer . 

Encryption  techniques  will  probably  be  applicable  to 
communications  security  between  nodes  in  the  network,  although 
some  protection  will  be  needed  against  possible  sabotage  by 
insertion  of  garbage  streams  into  the  communications  subnet. 
If  a  user's  location  or  entry  method  (e.g.,  local  phone, 
long-distance  phone,  secured  line)  cannot  be  verified, 
however,  encryption  may  be  needed  even  internally. 

Capability  mechanisms  [DENN76]  provide  one  approach  to 
centralized  security  which  may  be  adaptable  to  distributed 
databases.  Timestamps  consisting  of  a  node  number  appended  to 
a  local  clock  time  can  be  used  as  unique  (within  an  entire 
distributed  system)  identifiers  if  they  are  associated  with 
data  items  upon  entry  into  a  database  or  upon  value 
modifications  (updates).  Reed  suggests  [REED78]  that  updated 
values  be  considered  as  a  series  of  timestamped  versions  of  an 
original  data  item,  and  he  keeps  the  relationship  among  the 
versions  explicitly  as  a  history  associated  with  the  original. 
Wyleczuk  has  proposed  [WYLE79]  a  combination  of  capabilities 
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and  versions  to  provide  security  and  control  access  in  a 
distributed  environment.  The  use  of  versions  connected  by  a 
history  creates  a  context  in  which  timestamps  are  naturally 
associated  with  every  data  item  in  the  database.  The 
timestamps  can  then  be  used  both  to  identify  particular  items 
and  to  limit  the  effectiveness  of  the  access  privilege  granted 
by  a  capability.  Revocation  of  access  privilege,  for  example, 
can  be  made  automatic  by  including  in  a  capability  an 
expiration  time  for  its  effectiveness. 

A. 4  RELIABILITY 

One  important  motivation  for  distributed  database  systems 
is  the  potential  for  improvements  in  availability  and 
reliability  that  is  created  by  having  multiple  copies  of  data 
and  cooperating  multiple  sites  of  data  control.  Here, 
availability  refers  to  the  user's  view  of  whether  the  system 
is  functioning  so  that  it  can  accomplish  his  job. 

Reliability,  on  the  other  hand,  begins  with  availability  and 
adds  concern  for  being  able  to  continue  operation  even  if  some 
failure  occurs.  Centralized  systems  are  often  handicapped  by 
failures  in  critical  resources;  a  DDBMS  could  be  more  reliable 
by  having  few,  if  any,  resources  of  a  critical  nature.  For 
example,  if  one  node  becomes  inoperable,  another  might  be  able 
to  take  over  for  it  if  the  data  required  to  handle  the 
inoperable  node’s  transactions  are  stored  at  both  places. 
This  is  useful  only  insofar  as  the  database  integrity  is 
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preserved  when  the  failed  node  rejoins  the  distributed  system. 
So  back-up  and  recovery  are  as  much  a  part  of  reliability  as 
of  integrity. 

Multiple  copies  of  data  could  also  improve  response  time 
by  providing  quicker  access  to  closer  copies  where 
common ic at  ions  delays  would  be  shorter.  Quicker  response 
tends  to  make  the  system  more  available  to  (large  numbers  of) 
users  by  not  requiring  them  to  wait  for  the  service  they  seek. 

Increased  reliability  does  not  come  for  free,  however. 
Keeping  duplicates  of  any  large  amount  of  data  will  be  costly 
in  terms  of  memory  and  in  terms  of  processing  needed  for 
consistency  maintenance.  This  will  have  to  be  balanced 
against  the  costs  of  unreliability  (delayed  access  or  lost 
data).  Tradeoffs  will  be  necessary  to  determine  how  many 
copies  of  what  information  and  programs  are  to  be  kept  at  what 
locations  [MORG?/].  Copies  of  data  local  to  accessing 
programs  can  eliminate  immediate  communications  delays  but  can 
cost  update  synchronization  overhead.  The  amount  of 
appropriate  redundancy  may  well  depetid  on  the  size  of  the 
database  in  a  particular  situation  as  well  as  on  the  patterns 
of  data  and  program  usage.  Maintaining  complete  duplicate 
databases  at  each  node  will  at  least  be  too  costly  (in  terms 
of  storage  space  and  restruc tur ing  time)  for  very  large 
databases,  those  on  the  order  of  bits  IGERR77]. 
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In  general,  updates  to  a  database  are  considered  to 
involve  replacement  or  rewriting  of  the  data  items  to  be 
changed.  More  recently,  M.E.  Senko  ISENK7Y]  has  suggested 
that  a  database  should  consist  of  all  data  values  entered, 
with  each  data  item  consisting  of  a  data  value  and  a 
timestamp.  Such  a  database  fits  very  well  with  certain 
back-up  techniques  as  journaling  (recording  all  transactions 
against  the  database  with  identifying  information)  and  with 
representation  techniques  as  differential  files  (localization 
of  database  modifications  in  a  storage  area  small  relative  to 
the  entire  database  [SEVE76]).  In  this  way  updates  are  stored 
separately  from  the  main  archival  file  until  there  are  enough 
updates  to  make  their  application  to  and  the  reorganization  or 
restructuring  of  the  storage  for  the  main  file  be 
cost-effective.  * 

The  next  two  sections  represent  the  state  of  the  art  in 
DDB  research  based  specifically  on  reliability  considerations. 


A. 4.1  Sunshine’s  Model 

Sunshine  has  looked  at  developing  an  availability  model 
for  comparing  the  overall  performance  of  a  uniprocessor  and 
several  distributed  processing  system  interconnection  schemes 
[SUNS77].  His  approach  also  supports  the  idea  of  multiple 
data  copies.  Availability  is  taken  to  be  measurable  by  the 
probability  that  at  least  one  processing  element  is 
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functioning  so  that  it  can  accomplish  work  for  a  user  and  by 
the  average  amount  of  resources  available  to  do  the  work.  The 
cost  of  having  redundancy  among  processors,  programs,  and  data 
is  included  in  the  model. 

The  four  configurations  which  were  compared  are:  a  single 
processor,  N  stanaby  processors,  N  separate  processors,  and  N 
interconnected  processors.  To  make  the  total  processing  power 
the  same  for  all  candidates,  each  processing  element  in  the 
systems  of  N  processors  is  assumed  to  have  1 /N  times  the 
processing  power  of  the  uniprocessor  system.  The  results  show 
that  the  amount  of  time  the  system  is  available  can  be 
improved  over  the  uniprocessor  case  by  using  N  standby  or 
interconnected  processors.  Of  these  latter  two,  the 
interconnected  scheme  provides  the  greater  average  processing 
capacity  (although  both  are  less  than  the  uniprocessor).  Thus 
the  redundancy  required  to  achieve  greater  available  time 
costs  processing  capacity.  Further  study  is  needed  to  examine 
the  tradeoffs  involved  in  different  interconnection  schemes. 

The  model  is  simple  since  it  is  a  first  attempt  to 
formalize  measures  for  reliability. 

A.4.P  The  National  Software  Works 

One  complete  system  which  has  been  designed  from  the 
point  of  view  of  reliability  is  the  National  Software  Works 
(NSW),  described  in  [SHAP77].  The  goal  of  NSW  is  to  provide 
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through  the  ARPAnet  a  set  of  software  development  tools  and  a 
host- independent  user  interface  to  them.  The  monitor  portion 
of  NSw ,  the  Works  Manager,  requires  a  synchronized, 
duplicate-copy  distributed  database  to  hold  its  access 
control,  accounting,  and  auditing  information,  and  its  file 
catalog.  The  design  of  NSW  is  based  on  the  ideas  that  a 
system  is  reliable  if: 

1.  It  is  available  for  use, 

?.  all  actions  reported  to  a  user  as  complete  are 
correctly  reflected  in  any  future  state  of  the  system, 
and 

3.  a  user  can  probe  the  system  for  the  status  of  any 
action  he  has  initiated  and  thus  find  out  just  what  the 
state  of  the  DB  is  at  any  time. 

It  is  considered  very  unlikely  that  concurrent  access 
conflicts  will  be  a  problem  in  NSW.  The  nature  of  the 
database  precludes  conflicts  over  anything  other  than:  login, 
semaphore  setting,  or  entry  of  new  file  names  by  users  sharing 
name  spaces.  Nevertheless,  mechanisms  have  been  included  to 
ensure  reliable  operation.  First,  each  copy  of  the  Works 
Manager  explicitly  assigns  sequence  numbers  to  all  messages 
sent  to  other  copies  (M  numbers).  This  ensures  (1)  proper 
ordering  of  the  messages  at  the  receivers  and  (2)  detection 
and  identification  of  lost  messages,  so  that  communications 


are  reliable. 
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To  handle  updates,  each  data  item  has  associated  with  it 
the  number  of  times  its  value  has  been  changed  (R  numbers). 
To  request  an  update,  the  Works  Manager  gives  its  current  R 
number  for  the  data  item  and  assigns  a  timestamp  to  the 
request.  When  the  request  arrives  at  another  host,  its  R 
number  is  checked  against  that  for  the  local  copy  of  the  data. 
If  the  incoming  R  is  smaller,  the  request  is  discarded;  if  it 
is  equal  or  larger,  the  local  pending  queues  of  all  updates 
(there  is  one  for  each  other  Works  Manager)  are  searched  until 
(for  each)  either;  (1)  an  update  for  that  item  with  the  same 
R  and  an  earlier  timestamp  or  (2)  a  message  with  a  later 
timestamp  is  found.  If  all  host  queues  satisfy  (2),  the 
incoming  update  can  be  applied.  If  any  queues  produce  (1), 
the  update  with  the  earliest  timestamp  is  selected  for 
application  first.  It  can  be  shown  [SHAP77]  that  tlve  detailed 
operation  of  these  mechanisms  produces  correct  and  consistent 
results  across  the  system. 

A. 5  DATA  INCOMPATIBILITY 

Problems  with  data  compatibility  will  come  up  whenever 
different  processors  in  a  heterogeneous  distributed  system 
attempt  to  transfer  data,  since  there  are  no  computer  industry 
standards  for  internal  computer  representation  of  data. 
Different  processors  use  different  numbers  of  bits  to  make  up 
bytes  and  words,  and  the  translation  between  corresponding 
data  type  formats  may  result  in  loss  of  precision.  There  are 


different  r epresentat ions  for  inte^jers  (sign-magnitude,  one's 
complement,  two's  complement),  for  lloating  point  numbers  (bit 
location  of  mantissa  and  exponent,  number  of  bits  for  each, 
type  of  exponent  representation  such  as  ex  cess-three ,  and  even 
the  base  can  be  2,  8  or  16)  and  so  forth.  Different 
treatments  of  round-off  or  truncation  may  give  different 
results  for  numeric  computations  at  different  processing 
sites.  Even  where  standard  cnaracter  sets  are  adhered  to 
(ASCII  and  EBCDIC,  for  example),  different  uses  of  control 
characters  or  sequences  are  still  common  and  must  be 
translated.  Composite  data  structures  such  as  arrays  or 
complex  numbers  are  stored  differently  by  various  processors, 
and  the  possibility  for  embedded  pointers  complicates  things 
even  fur  ther  . 

Current  approaches  to  the  data  incompatibility  problem 
include:  homogeneous  networks  (i.e.,  ignore  the  problem); 
negotiation  of  options  as  under  ARPANET'S  Telnet  or  File 
Transfer  Protocol;  translators  such  as  ARPA's  Data 
Reconfiguration  Service,  for  users  of  particular  application 
programs;  and  translating  to  and  from  some  pre-specif ied 
network  standard,  such  as  the  proposed  Intermediate  Data 
Format  [LEVI77].  Notice  that  these  are  all  actually  political 
strategies  for  circumventing  the  problems,  and  are  not  really 
technical  solutions. 


Levine's  discussion  [LEVI?/]  emphasizes  that  descriptive 
information  about  data  must  at  least  accompany  it,  and  should 
really  be  considered  part  of  it.  This  includes: 

--definition  or  class  (e.g.,  character,  integer, 
instruction) , 

--internal  format  representation  (e.g.,  two's 
complement) , 

— precision. 

This  issue  is  still  a  large  problem,  and  there  is  little  work 
reported  in  the  literature. 

A. 6  ORGANIZATION 

Research  on  the  organization  of  distributed  database 
systems  has  focused  mainly  on  where  to  physically  locate  files 
and  directories  among  the  network  nodes.  Several  cost  models 
have  been  developed  [e.g.,  CHUW73,  LEVI76,  CHAN76,  MAHM76]  and 
optimizations  were  found  to  be  very  expensive  without 
development  of  suitable  heuristics.  Other  organization 
problems  include:  file  decomposition  when  partitioning  is 
desired,  dynamic  movement  of  files  as  user  access  patterns 
change,  file  location  to  achieve  some  system  goal  such  as 
maximum  acceptable  response  time.  Most  of  these  are  related 
to  the  cost/performance  goals  set  for  the  database  system  by 
its  designers  and  installers,  and  will  be  solved  only  on  the 
basis  of  more  experience  with  the  performance  of  experimental 
systems . 


A. 7  IMPLKMENTATIONS 


No  full-blown  jeneral-purpose  distributed  database 
management  systems  are  advertised  as  existing  today.  The 
purpose  of  this  section,  then,  is  to  describe  briefly  some  of 
the  planned  systems  and  experimental  implementations  which 
will  actually  lead  to  DDBMSs  that  work.  It  is  difficult  to 
determine  whether  any  of  these  systems  will  allow  per  form ance 
evaluation  of  various  implementation  alternatives,  such  as 
comparison  of  different  concurrent  access  control  methods. 
Notice  that  research  results  from  inside  the  computer  industry 
are  conspicuously  absent,  probably  for  company  proprietary 
reasons . 

A. 7.  1  INGRES 

INGRES  is  a  database  management  system  for  relational 
databases  [STON76].  It  is  being  implemented  for  PDP-11 
computers  on  top  of  the  UNIX  operating  system.  The  extensions 
of  INGRES  being  made  to  handle  a  distributed  database  [STON77] 
assume  (1)  a  network  of  machines  homogeneous  in  that  they  all 
run  UNIX,  and  (2)  a  UNIX  to  UNIX  communication  capability. 

Distributed  INGRES  is  a  primary,  back-up  scneme  based  on 
Alsberg's  n-host  resiliency  approach,  except  that  different 
data  objects  can  have  different  primary  sites.  A  particular 
goal  of  INGRES  [STON78]  is  to  provide  a  unified  distributed 
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solution  to  both  the  concurrency  control  and  crash  recovery 
aspects  of  the  integrity  problem.  It  is  assumed  that  the 
system  exhibits  a  high  degree  of  locality  of  reference;  i.e., 
that  the  data  required  to  process  most  transactions  (about  955t 
is  suggested)  are  stored  locally.  Consequently,  performance 
is  based  specifically  on  whether  data  to  be  referenced  is 
local  or  not;  there  is  no  consideration  of  "nearby"  or  "close 
neighbor"  sites. 

Two  approaches  are  provided  for  concurrent  access  control 
and  for  crash  recovery: 

*  the  "performance"  oriented  approach  provides  the  fastest 
possible  response  time,  and  may  have  consistency  problems 
if  the  system  is  not  crash-free;  and 

*  the  "reliability"  oriented  approach  ensures  mutual 

consistency  among  copies,  and  may  pay  a  large  performance 
penalty . 

INGRES  updates  are  always  directed  to  the  primary  copies,  but 
the  high  performance  approach  will  allow  users  to  specify  that 
they  want  retrievals  to  be  answered  quickly  from  the  local 

data  copy  without  regard  for  what  else  is  occurring  in  the 

system . 

Concurrency  control  is  handled  by  having  a  local 
concurrency  controller  (cc)  at  each  site,  running  some 

appropriate  algorithms.  Deadlock  is  a  particular  problem 
since  there  are  multiple  primary  sites.  The  cc  is  responsible 


for  deleotinft  ony  loc.jl  de;nllooks  within  Its  own  tn.u'hine,  and 
baokin^^  out  or  undolt\g  tho  effects  of  an  appropriate  local 
transaction.  In  addition,  oo  must  report  any  walt-for 
conditions  caused  by  oonfllctinK  lock  requests  to  a  program 
c.illed  SNtiOP.  SNOOP  is  located  at  some  particular  site  Just 
to  construct  t^lubal  wait- for  graphs  so  that  in  ter- site 
deadlock  can  be  detected  and  handled.  In  other  words,  SNOOP 
IS  the  primary  site  for  t  lie  wait- for  data.  Such  a  central  i ’.’ed 
de.idlook  control  solution  is  not  only  simple,  it  is  considered 
by  Stonebraker  particularly  appropriate  for  the  hi^h  decree  of 
locality  of  reference  on  which  INGRKS  is  predicated.  Petalls 
of  the  performance  and  reliability  algorithms  are  described  in 
ISTON  fb  1. 

The  or^tan  i  7.a  t  ion  of  I NOOIKS  ISTONV'M  is  based  on  n-ary 
relations  iMAKTYb],  whiv'h  may  be  distributed  across  machines 
in  subsets  of  the  rows  of  the  relation.  Two  types  of 
relations  .ire  supporle.l:  re^tular  relations,  which  are  shared 
by  the  network,  and  local  relations,  which  are  visible  only  to 
local  users  on  the  machine  where  the  relation  is  stored.  Each 
DDB  site  keeps  a  complete  system  catalog  for  all  its  locally 
resident  relations,  and  keeps  ,i  complete  list  of  names  of  .all 
the  regular  relations  in  the  network.  When  access  is  made  to 
a  non-resident  relation,  the  appropriate  catalog  entry  is 
requested  and  saved.  Such  information  is  not  updated  and  is 
simply  discarded  after  some  pre- spec i f I ed  amount  of  time. 
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and  analysis  is  done  ahead  of  time,  run-time  coordination  uses 
fast  table  look-up  to  choose  protocols  approppriate  for 
whatever  transactions  occur.  Only  transactions  which  might 
conflict  or  unforeseen  types  of  transactions  will  be  required 
to  run  the  full  synchronization  protocols  with  their  higher 
overhead  . 

The  organization  of  SDD-1  is  based  on  a  relational  data 
model.  Each  relation  to  be  distributed  across  sites  for 
physical  storage  is  partitioned  into  rectangular  subrelations 
called  fragments  as  the  unit  of  distribution.  Redundancy  is 
possible  at  the  fragment  level;  it  is  not  expected  that  copies 
of  the  entire  database  will  be  realistic.  A  collection  of 
fragments  which  make  up  a  complete,  non-redundant  copy  of  the 
database  is  called  a  materialization.  Directories  are  treated 
in  the  same  way  as  ordinary  user  data,  except  that  directory 
information  for  directories  themselves  is  stored  at  every  data 
module.  Timestamps  are  associated  with  each  physically  stored 
copy  of  a  data  item  and  represent  the  time  at  which  the  last 
transaction  modified  that  copy  of  the  data.  Since  data  items 
are  the  smallest  pieces  of  information  that  can  be 
independently  accessed  in  the  DDB  (somewhat  in  analogy  to  the 
fields  of  a  logical  record),  considerable  extra  storage  space 
will  be  required  for  the  timestamped  items  as  compared  with 
the  space  for  Just  the  items. 
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SDD-1  uses  an  optimization  technique  to  break  distributed 
transactions  into  a  combination  of  local  processing  on  data 
located  at  a  single  site  and  of  moving  data  among  sites  so 
that  it  can  be  processed.  The  object  is  to  minimize  the  total 
cost.  The  process  basically  consists  of  iteratively  trying  to 
improve  the  cost  of  a  simply  chosen  strategy  by  examining  each 
step  to  see  if  it  can  be  replaced  by  a  lower  cost  combination 
of  other  moves  and  local  processing.  Moving  cost  is  assumed 
proportional  to  the  amount  of  data  moved;  processing  cost 
depends  on  the  operation  and  the  size  of  the  relation. 


200 


Appendix  B 


DETAILS  OF  THE  MODEL  ANALYSIS 


B.1  THE  BASIC  M/M/1  QUEUING  MODEL 

The  notation  M/M/1  specifies  first  that  the  interarrival 
times  are  exponentially  distributed  (M),  next  that  the 
distribution  of  service  times  is  exponential  (M),  and  finally 
that  there  is  just  a  single  server  (1)  [KLEI75].  The 
exponential  distributions  are  characterized  by  their  means;  we 
will  use  L  to  represent  the  average  arrival  rate  to  the 
system,  and  XBAR  for  the  average  service  time  (we  could  have 
used  an  average  service  rate  such  as  mu  =  1  /  XBAR,  but  XBAR 
is  more  convenient). 

For  any  single  server  system  the  "utilization”  is  defined 
to  be  the  product  of  the  average  arrival  rate  and  the  average 
service  time, 

rho  =  L  •  XBAR  ; 

so  that  the  utilization  represents  the  ratio  of  the  rate  at 
which  work  enters  the  system  to  the  rate  at  which  work  is 
performed.  Utilization  is  also  interpreted  as  the  fraction  of 
the  time  that  the  server  is  actually  busy  performing  work. 
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Tne  measure  of  queuing  system  behavior  with  which  we  are 
particularly  concerned  in  this  analysis  is  the  average  total 
time,  T,  spent  in  the  system  by  a  transaction  (both  waiting  in 
the  queue  and  being  served).  For  M/M/1,  the  average  total 
time  spent  in  the  system  is 

T  =  XBAR  /  (  1  -  rho  )  . 

B.1.1  Centralized  Management 

In  the  centralized  system,  we  have  two  types  of 
arrivals:  updates  and  retrievals.  If  we  call  the  update 
arrival  rate  from  each  terminal  LU,  and  the  retrieval  arrival 
rate  from  each  terminal  LR,  then  for  our  central  system  wjiich 
has  n  terminals  per  node  and  only  N  =  1  (i.e.,  the  central) 
node,  the  total  arrival  rate  to  the  server  is 
L=n»N»LR  +  n»N»LU. 

Rather  than  simplify  the  expression  by  substituting  in  the 
N  =  1 ,  we  will  keep  it  in  this  form  for  easier  comparability 
with  the  expressions  for  distributed  systems. 

In  general,  updates  and  retrievals  may  require  different 
amounts  of  service  time.  We  will  call  the  average  service 
time  required  for  an  update  XU,  and  the  average  service  time 
required  for  a  retrieval  XR.  Then  the  average  service  time 
for  the  entire  system  (XBAR)  depends  both  on  the  individual 
times  for  retrievals  and  updates  (XR,  XU)  and  on  the  relative 
frequencies  of  occurrence: 


n«N«LR  n«N»LU 

XBAR  - - •  XH  ♦  . •  XU 

L  1. 

For  A  the  ratio  of  update  to  retrieval  arrivals  (i.e., 
A  =  LU  /  LR  ) ,  we  rewrite 

L  =  n  •  N  •  aU  >  LR)  =  n  •  N  *  LR  •  (A  +  D  and 

A  •  XU  ♦  XR 

XBAR  = . . . 

A>1 

This  gives  us 

rho  =  n  "  N  •  LR  •  (A  •  XU  ♦  XR) 
and  we  see  that 

A  •  XU  >  XH 

r  : - - - 

(A  1  ) •  (  1  -  rho ) 

We  are,  of  course,  restricted  on  the  average  to  rho  <  1  in 
order  that  the  system  be  stable  and  not  bog  down. 

Each  transaction  in  the  centralized  system  must  first  be 
sent  to  the  central  node,  then  enter  the  <^ueue,  be  serviced, 
and  finally  be  sent  back  to  the  terminal.  The  average 
response  time  for  a  transaction  is  thus 
RT  =  GDI  ♦  T  ♦  Cn2  , 

where  GUI  is  the  communications  delay  from  terminal  to  central 
node  and  GD2  is  the  commun ic at  ions  delay  from  the  central  node 
back  to  the  terminal.  In  order  to  focus  on  the  issues  of  the 
management  schemes  being  compared,  we  will  simplify  the  detail 
of  the  communications  contributions  by  taking  GDI  =  CD2  =  CD, 
a  constant.  In  our  first  order  treatment,  this  could  be 


considered  as  a  distribution  of  communication  times 
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represented  by  its  average  value. 

Substituting  for  T  and  combining  the  CD  terms,  we  get 

A  •  XU  +  XR 

RT  =  2*CD  ♦  - 

(A  1)*(1  -  rho) 

B.1.2  Master/Slave  Analysis 

In  the  operation  of  a  master/slave  system,  each  slave 
node  is  responsible  for  processing  the  retrieval  transactions 
it  receives  from  its  own  locally  connected  terminals  and  for 
applying  all  of  the  system  update  transactions  against  its 
local  (complete  duplicate)  copy  of  the  database.  This  gives  a 
total  average  arrival  rate  at  a  slave  node  of 
Ls  =  n  *  LR  +  n  »  N  *  LU  . 

Using  the  proper  proportions  of  each  transaction  type,  we  see 

that  the  average  service  time  within  a  slave  node  is 

n  »  LR  n  «  N  •  LU 

XBARs  - - •  XR  +  - •  XU 

Ls  Ls 

Substituting  A  =  LU  /  LR  and  simplifying,  we  get 

Ls  =  n  •  LR  •  (N  •  A  +  1 )  and 

N  *  A  »  XU  +  XR 

XBARs  =  - 

N  •  A  +  1 

The  master  node  is  responsible  for  all  of  the  system 
update  processing  (such  as  sequencing,  resolving  conflicts, 
application)  before  the  accepted  update  is  sent  back  out  to 
the  slaves  for  local  application  to  the  slave  database  copies. 
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Since  the  master  also  has  locally  connected  terminals  for 
which  it  handles  retrievals,  the  total  average  transaction 
arrival  rate  at  the  master  is 

Lm=n*N*LU+n*LR, 

which  we  i.iimed  iatel  y  recognize  as  the  same  as  Ls  for  the  slave 

nodes.  The  same  kinds  and  rates  of  transaction  arrivals  apply 

also  to  the  back-up  node  (which  provides  the  2-host 

resiliency)  in  the  master/slave  system.  Since  all  nodes  have 

now  been  shown  to  handle  the  same  total  average  arrival  rates 

with  the  same  average  service  times,  we  can  say  that  in  a 

master /slave  system,  for  each  node, 

L  :  n  •  LR  »  (N  *  A  1  )  and 

N  »  A  •  XU  XR 

XBAR  :  - . - . 

N  *  A  -f  1 

From  L  and  XDAH,  we  get  that  the  utilization  of  each  node  is 

rho  :  n  *  LR  •  (N  •  A  •  XU  f  XR) 

and  that  the  average  time  spent  in  any  one  node  is 

N  •  A  •  XU  4.  XR 

r  - - 

(N  •  A  ♦  1  )•(  1  -  rho)  . 


The  average  response  time  throughout  the  system  must  take 
into  account  both  different  types  of  transactions.  Using  the 
notation  p(event)  to  represent  the  probability  of  that  event, 
we  have 

RT  =  p(update)  •  RTU  p(retrieval)  •  RTF  , 
where  RTU  and  RTR  are  the  average  response  times  for  updates 
and  retrievals,  respectively.  Now  considering  the  different 


206 


In  the  master/slave  system,  since  retrievals  are  all 
handled  locally,  they  contribute  no  network  messages  to  the 
cost  predictor.  The  contribution  of  the  updates  is  weighted 
by  their  probability  of  occurrence,  so  that  the  system  cost  in 
terms  of  the  average  number  of  messages  required  to  process 
any  transaction  is 

A  3*N-2 

m  - - • - 

A  +  1  N  . 

B.1.3  Synchronized  Management 

The  control  token,  as  any  other  message  or  transaction, 
requires  time  CD  to  be  passed  from  one  node  to  its  successor. 
The  length  of  time  for  a  circuit  of  N  nodes,  i.e.,  the  time 

between  successive  token  arrivals  at  a  single  node,  is  then 

N  *  CD.  During  that  time,  the  average  number  of  arrivals  at 
each  node  is  n  •  N  *  CD  *  LR  retrievals  and  n  *  N  •  CD  •  LU 
updates.  All  of  the  updates  from  all  of  the  nodes  go  into  the 
batch  and  the  retrievals  are  saved  locally  at  each  node  to  be 
added  at  the  end  of  the  batch  processing,  for  a  total  batch 
size  at  any  node  of 

N  •  (n  »  N  »  CD  »  LU)  +  (n  •  N  »  CD  »  LR) 

=  n  »  N  •  CD  »  (N  »  LU  +  LR). 

The  time  required  to  process  the  batch  sequentially  is 
TBX  =  n  »  N  •  CD  »  (N  »  LU  •  XU  ♦  LR  •  XR). 

Since  the  time  between  batches  is  N  •  CD,  we  can  define  the 
utilization  of  the  server  at  each  node  to  be 


‘J. 
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rtio  :  FBX  /  U  *  CD)  , 

which  IS  (as  usual)  the  fraction  of  time  the  server  is  busy. 
We  are  constrained,  as  mentioned  before,  to  have 
rho  :  n  •  (N  •  LU  •  XU  -*■  LR  •  XR)  <  1. 

The  average  response  time  for  a  transaction  will  be  made 
up  of  three  parts: 

•  waiting  for  the  control  token, 

•  waiting  for  the  predecessors  in  the  queue  to  be 
served,  and 

•  actual  service  time. 

Using  our  assumption  of  activity  being  uniformly  distributed 
across  all  terminals  and  nodes  of  the  network,  we  will  say 
that  the  average  wait  for  the  control  token  is  half  of  the 
time  for  a  virtual  circuit,  .5  •  N  •  CD.  The  order  in  the 
queue  is  updates  from  all  other  nodes,  local  updates  and  then 
local  retrievals.  All  local  transactions  wait  for  non-local 
updates  (from  the  other  N-1  nodes)  to  be  processed, 
n  •  (N-1 )  •  CD  •  N  •  A  •  LR  •  XU 
(recalling  that  LU  =  A  *  LR).  Since  average  service  times  are 
constant  for  a  given  type  of  transaction,  the  uniform 

distribution  says  the  average  wait  for  local  updates  will  be 
for  half  the  total  number  and  the  average  wait  for  local 
retrievals  will  be  for  all  updates  plus  half  the  number  of 
retrievals  in  the  batch.  Thus  we  have 
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RT  :  »  N  *  CD  +  n  •  (N-1  )  •  N  •  CD  •  A  »  LR  •  XU 

pCupdate)  *  (.5  •n»N*CD»A*LR*  XU) 
p(retrieval)  •  (  n»N  *CD»A*LR»XU  +  .  5«n«N  •CD»LR*XR  )  . 

Putting  in  the  probabilities  and  combining  waits  with  service 
times  gives 

RT  =  .5»S«CD  +  n«(N-1)»N»CD«A»LR«XU 
+  [A/(A+1)]  *  ( .5*n»N*CD»A»LR*XU  +  XU) 
f  [1/(A+1)]  *  [n«N«CD*LR»(A»XU  +  .5*XR)  +  XR  ]  . 

If  we  treat  the  control  token  as  having  the  updates  to  be 
circulated  appended  to  it,  the  combined  token/update  package 
can  be  'Considered  as  a  message.  For  each  update,  then,  N 
messages  are  required  to  complete  the  circulation,  and  the 
number  of  messages  averaged  over  both  updates  and  retrievals 
gives  us 

NM  =  tA/( A+1  )  ]  *  N  . 

B.2  EXTENSION  OF  THE  MODEL  TO  HANDLE  QR/BI  RETRIEVALS 

From  Kleinrock's  [KLEI76]  derivations  for  non-preemptive 
head-of-the-line  priority  queuing,  we  know  that  the  average 
delay  (WO)  due  to  a  customer  already  in  service  as  observed  by 
a  new  arrival  is 

P 

WO  =  >  (Li  •  XiSQBAR  /  2) 

i=1 


'  f--.. - . 
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where  Li  is  the  average  arrival  rate  for  the  i-th  priority 
class,  XiSQBAR  is  the  second  moment  of  the  service  time 
distribution  for  the  i-th  priority  class,  and  there  are  P 
priority  classes  in  the  queue.  We  can  simplify  this 
expression  by  taking  advantage  of  our  knowledge  of  the 
exponential  distribution  where  the  second  moment  is  just  twice 
the  square  of  the  first  moment  [KLEI75]: 

P 

WO  =  >  Li  »  XiBAR»»2 
i=1 

where  now  XiBAR  is  the  first  moment  of  the  service 
distribution  for  the  i-th  priority  class,  and  that  is  just 
what  we  have  been  working  with,  the  average  service  time!  For 
the  QR-BI  model,  P  =  2;  the  QR  transactions  will  be  one 
priority  class  and  the  BI  and  update  transactions  will  be  the 
other.  So  for  QR-BI  we  will  work  with  just 
WO  =  LI  «  X1BAR»»2  +  L2  »  X2BAR»»2  . 

For  each  of  the  management  schemes,  then,  we  will  have  to 
discover  the  average  arrival  rates  and  average  service  times 
by  priority  class. 

Each  new  arrival  to  the  priority  queuing  system  will  have 
to  wait,  not  only  for  any  customer  in  service  when  he  arrives, 
but  also  for  any  higher  priority  customers  who  are  already  in 
the  queue  or  who  arrive  before  our  particular  arrival  gets 
into  service.  Kleinrock  shows  [KLEI76]  that  these  waits  sum 
together  to  give  the  average  wait  time  in  queue  for  customers 
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of  each  priority  class, 

P  P 

Wp  =  WO  ♦  >  XiBAR»Li»Wi  ♦  >  XiBAR«Li«Wp 

i:p 

for  p  =  1,  2,...,P  .  Notice  that  this  set  of  equations  is 
entirely  based  on  just  the  things  that  we  have  assumed  across 
the  database  management  system  models. 

The  solutions  to  the  wait  time  equations  are  more 
conveniently  written  in  terms  of  a  quantity  sigma,  which 
Kleinrock  defines  for  each  priority  class  p  as: 

P  P 

# 

sigma(p)  =  >  XiBAR»Li  =  >  rho(i) 
i=p  i=p 

Then  the  solutions  to  the  waiting  times  become 

WO 

Wp  =  - - - - 

[l-sigma(p)]  *  [ 1 -sigma( p+ 1  )  ] 

for  p=  1,  2,  ...,  P  . 


We  restrict  the  solution  to  the  two  priority  classes  for 
the  QR-BI  modeling: 

p  =  1  for  updates  and  BI  retrievals  and 
p  =  2  for  QR  retrievals 

so  that  the  variables  in  the  equations  will  be  LI  and  X1BAR, 
L2  and  X2BAR,  sigmal  =  sigma( 1  )  and  sigma2  =  sigma(2),  W1  and 
W2.  The  average  wait  due  to  some  transaction  already  in 


service  becomes 
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WO  =  LI  *  (X1BAR)»»2  ♦  L2  *  U2BAR)«*2  , 
and  the  average  wait  times  (in  queue)  will  be 
W1  =  WO  /  [(l-sigma1)  *  (1-sigma2)]  and 
W2  =  WO  /  ( 1 -sigma2)  . 

Since  the  average  total  time  (in  queue  and  In  service)  is  the 
sum  of  the  average  wait  and  average  service  times,  we  get 
TU  =  W1  +  XUBAR  for  updates, 

TBI  :  W 1  XRBAR  for  BI  retrievals,  and 
TQR  X  W2  -f  XRBAR  for  QR  retrievals. 

These  processing  times  will  be  combined  with  various 
communications  delays  according  to  the  management  schemes  to 
yield  average  system  response  times.  As  usual,  solutions  will 
be  constrained  by  utilizations  less  than  one: 

rho  =  LI  *  X1BAR  +  L2  *  X2BAR  <  1  . 


B.2.1  Centralized  Management 

In  the  centralized  database  management  scheme,  all 
transactions  are  handled  at  the  central  node.  So  for  n 
terminals,  all  communicating  with  the  (N=1,  single)  central 
node,  the  arrival  rates  within  priority  classes  are 
L)  =  n  •  N  •  (LU  LBI)  and 
L2  =  n  •  N  •  LQR  , 

where  LU  is  the  update  arrival  rate  per  terminal  (as  before), 
LBI  is  the  arrival  rate  per  terminal  for  BI  retrievals,  and 
LUR  is  the  arrival  rate  per  terminal  for  OR  retrievals. 
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Since  we  have  grouped  the  BI  retrievals  in  the  same 
priority  class  as  the  updates  and  the  two  may  have  different 
average  service  times  (XR  and  XU),  we  must  use  a  mean  value 
for  the  class  average: 


LU  LBI 

XIBAR  - - •  XU  f  - •  XR 

LU+LBI  LU+LBI 


In  contrast,  the  high  priority  class  has  only  QR  retrievals  to 
service,  and  we  have  simply 
X2BAR  =  XR  . 

Noticing  that  the  total  retrieval  arrival  rate  per 
terminal  is  LR  =  LQR  -f  LBI,  we  again  define  A  =  LU  /  LR  as  the 
ratio  of  update  to  retrieval  transactions.  We  choose  for 
another  parameter  the  ratio  of  the  QR  to  the  BI  requests, 
B  =  LQR  /  LBI.  Rewriting  in  these  terms  the  basic  quantities 
so  f ar ,  we  get 

LI  =  n  •  N  •  LBI  *  [  A»(B+1)  ♦  1  ] 

A  »  (B+1)  •  XU  XR 

XIBAR  =  - 

A  »  (B+1)  +  1 

L2  =  n  •  N  »  B  •  LBI 
X2BAR  =  XR  . 

From  these  expressions  we  calculate 

n»N«LBI»  ([A»(B+1)»XU  +  XR]«»2) 

WO  . . — .  ♦  n«N«B»LBI»  (XR»»2)  , 

A»(B+1)  +  1 


213 


sigmal  =  n  »  N  »  LBI  *  [  A*(B+1)»XU  ♦  (B^1)»XR  ]  ,  and 
3igma2  =  n  •  N  •  B  •  LBI  •  XR  . 

The  wait  times  for  each  priority  class  are  thus 
W1  =  WO  /  [( 1 -sigma  1 )•( 1 -sigma2 )  ]  and 
W2  =  WO  /  ( l-sigma2)  . 

We  combine  the  waiting  and  service  times  to  get  the  total 
time  required  in  the  system  for  the  different  transaction 
types : 

TU  =  W1  +  XU  for  updates, 

TBI  =  W1  +  XR  for  BI  retrievals,  and 
TQR  =  W2  +  XR  for  the  QR  retrievals. 

The  utilization  restriction  is  that 

1  >  rho  r  rho(l)  rho(2)  =  L1»X1BAR  +  L2«X2BAR  . 

As  in  the  basic  case  of  centralized  management,  the 
average  response  time  for  the  entire  system  is  made  up  of  both 
the  communication  delays  and  the  average  time  spent  in  the 
central  node 

A  1  B 

RT  =  2«CD  + - »TU  + - •TBI  + - »TQR 

A  +  1  (A+1)»(B+1)  (A+1)»(B-f1) 

The  only  messages  crossing  the  network  for  the  QR-BI 
centralized  system  are  the  same  communications  between  the 
terminals  and  the  central  node  as  in  the  basic  case,  so 
NM  =  2 

for  our  cost  prediction. 
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B.2.2  Master/Slave  Management  Analysis 

Let  us  begin  by  considering  the  master  node,  since  under 

the  QR-BI  scheme  the  master  picks  up  the  additional  work  of 

answering  all  the  BI  retrievals  for  the  network.  This  gives  a 

low-priority  retrieval  arrival  rate  of 

LI  =  n»N»(LU  LBI)  =  n«N«LBI*[  A«  ( B+1)  +  1] 

and  an  average  service  time  of 

A  •  (B+1  )  •  XU  ♦  XR 

X1BAR  =  - 

A  »  (B+1 )  +  1 

The  high-priority  arrival  rate  at  the  master  is  just  its  share 
of  the  OR  requests  as  input  from  its  locally  connected 
terminals 

L2  =  n  •  LOR  =  n  «  B  »  LBI  , 
and  the  average  service  time  is 
X2BAR  =  XR  . 

We  now  have  enough  information  for 

n«N«LBI«  ([A«(B+1)»XU  XR]»»2) 

WOm  . . - .  +  n»B»LBI»  (XR»«2) 

A»(B+1)  ♦  1  , 

sigmalm  =  n  •  LBI  »  [  N»A»(B+1)«XU  ♦  (N+B)«XR  ]  ,  and 

sigma2m  =  n  •  B  •  LBI  •  XR  . 

As  usual,  the  total  time  spent  in  the  system  by  each 

transaction  type  is  the  sum  of  its  wait  and  service  times: 


TUm 

:  Wim 

•f 

XU 

for 

updates , 

TBIm 

=  W1m 

XR 

for 

Bis,  and 

TQRra 

s  W2m 

♦ 

XR 

for 

QRs. 

The  slave  and  back-up  nodes  have  exactly  the  same  arrivals  for 


QR-Bl  ds  they  did  for  the  basic  case.  The  updates  are 
processed  in  the  low-priority  class,  for  which 

L1s=n*N»LU:n»N»A»  ( 1 )  •  LBI  and 
XIBARs  =  XU  . 

The  behavior  of  the  high-priority  class  depends  only  on  the  UR 
retrievals,  so 

L2s  =  n  •  LQR  =  n  •  B  •  LBI  and 
X2BARS  =  XR  . 

Calculating  our  now-familiar  quantities 

WOs  =  n»N»LBI»A*(B-f  1  )»(XU»»2)  +  n«B»LBI»( XR»»2 )  , 
sigmals  =  n»LBI»  [  N  •A*  (  B+ 1  )  "XU  B«XR]  ,  and 
sigma2s  =  n»B»LBI»XR  , 

we  are  prepared  to  get  the  average  total  time  in  the  system  by 
transaction  type: 

TUs  =  WIs  ♦  XU  for  updates  at  the  slaves, 

TQRs  =  W2s  ♦  XR  for  QRs  at  the  slaves,  and 
TUb  =  TUs  for  updates  at  the  back-up,  and 
TQRb  =  TQRs  for  QRs  at  the  back-up. 


As  in  the  basic  case,  updates  are  handled  by  the  master 
and  back-up  (with  2-host  resiliency), 

RTU  =  [(3N-2)/N]  •  CO  TUm  TUb  . 

All  BI's  which  arrive  at  nodes  other  than  the  master  are  sent 
there  for  processing,  while  the  master's  Bis  are  local.  The 
average  response  over  the  system  for  a  BI  retrieval  is  thus 


N-1 


1 


N 


N 


RTBI  = 


•  (2»CD  +  TBIm)  -*■ 


•  TBIm 


216 


QR  retrievals  are  all  processed  at  their  local  nodes,  so  that 
on  the  average 

1  1  N-2 

RTQR  = - •  TQRm  - »  TQRb  - •  TQRs 

N  N  N  . 

We  can  combine  transaction  response  times  with  the  frequencies 

of  occurrence  for  the  average  system  response 

A  3N-2 

RT  - - »  - "CD  ♦  TUm  +  TUb 

A-t-1  N 


1  N-1 

♦ -  »  »  2  »  CD  +  TBIm 

(A+1)»(B+1)  N 

B  TQRm  TQRb  N-2 

f  . . .  •  - + - +  TQRs»--- 

(A+1)»(B+1)  N  N  N 


The  cost  prediction  for  the  QR-BI  model  of  master/slave 

management  simply  adds  the  contribution  for  the  Bis  to  the 

update  contribution  of  the  basic  case: 

A  3N-2  1  N-1 

NM  = - • -  +  - • - «  2 

A+1  N  (A+1)»{B+1)  N 


B.2.3  Analysis  of  Synchronized  Management 

Unlike  the  previous  QR-BI  models,  where  we  have  been  able 
to  capitalize  directly  on  Kleinrock’s  presentation  of 
head-of-the-line  priority  queuing,  the  batch  nature  of  sending 
transactions  around  the  virtual  circuit  with  the  control  token 
destroys  the  premise  of  the  wait  time  analysis.  The  arrivals 
are  no  longer  Poisson  to  each  priority  class.  We  will  attempt 
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to  cover  the  situation  by  breaking  our  consideration  of  the 
system  activity  into  two  parts:  batch  processing,  during 

which  OR  retrievals  are  given  non-preempt i ve  priority  over 
batch  items;  and  processing  between  batches,  when  QRs  arrive 
and  wait  for  service  only  with  any  previous  QRs. 

In  general,  the  k-th  local  update  in  the  batch  must  wait 
while  all  the  updates  in  the  batch  from  the  other  nodes  are 
processed,  while  its  k-1  local  predecessor  updates  are  served, 
and  for  any  local  QR  transactions  which  arrive  before  the  k-th 
update  is  itself  accepted  for  processing.  We  can  express  this 
by 

WU(k)  =  (N-1 )»(n»N»CD»LU)»XU  +  (k-1)»XU  +  LQR«WU(k)»XR  , 
where  WU(k)  stands  for  the  wait  time  experienced  by  the  k-th 
local  update,  and  where  the  last  term  is  the  QR  arrival  rate 
times  the  interval  of  interest  times  the  required  service 
time.  Solving  this  equation  for  the  wait  time  itself,  we  find 

(N-1 )»n»N»CD»LU»XU  ♦  (k-l)»XU 

WU(k)  =  - 

1  -  LQR»XR 

All  of  the  updates  are  located  together  in  the  batch,  so  that, 
on  the  average,  the  uniform  distribution  says  the  wait  time 
will  be  for  half  of  the  total  number  of  local  updates  actually 
in  the  batch: 

(N-1 )»n»N»CD»LU«XU  ♦  ( 1 /2 )»(n»N«CD«LU»XU-1 ) 


WU 


1  -  LQR«XR 
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Similarly,  the  local  BI  retrievals  in  the  batch  roust  wait 
for  all  of  the  updates  to  be  served,  for  half  (on  the  average) 
of  the  Bis  themselves,  and  for  any  arriving  QRs.  In  a  manner 
directly  analogous  to  the  development  of  WU,  then,  we  get 
N»n»N»CD«LU«XU  ♦  ( 1 /2 ) •n»N»CD«LBI "XR 


WBI 


1  -  LQR»XR 


To  express  the  average  wait  time  for  the  QR  retrievals, 
we  must  take  into  account  the  difference  between  arrivals 
which  arise  during  and  between  batch  processing.  Let  us 
define,  as  in  the  basic  case,  the  average  utilization  of  the 
server  due  to  the  batch  processing: 

rho(b)  =  TBX  /  (N»CD), 

where  TBX  is  the  time  to  process  the  items  in  the  batch 
sequentially  and  N*CD  is  the  total  time  required  for  the 

control  token  to  complete  a  circuit  of  the  virtual  ring.  We 

can  see  that 

TBX  =  N»n»N»CD»LU»XU  ♦  n»N»CD»LBI»XR  , 
giving  us 

rho(b)  =  n»N»LU»XU  ♦  n»LBI»XR 

=  n»LBI»  [  N»(A-^1  )»B»XU  >  XR  ]  . 

So,  on  the  average,  QR  transactions  will  arrive  at  the  service 
queue  to  find  batch  processing  in  progress  a  fraction  rho(b) 
of  the  time.  Since  we  have  chosen  a  non-preemptive  scheme, 

the  QRs  will  have  to  wait  an  extra  amount  of  time  for  the 

batch  item  in  progress  to  be  finished.  WO,  the  wait  due  to 


some  customer  being  in  service  when  our  new  QR  arrives,  is 
just 

WO  :  rho(b)  •  average  time  left  in  service  . 

From  the  memoryless  property  of  the  exponential  distribution 
[KLEI7b],  we  get  that  the  average  time  remaining  in  service  is 
just  the  average  service  time  for  the  batch  items, 
n»N«N»CD»LU»XU  ♦  n»N •CD»LB I *XR 
n»N«N*CD»LU  n»N«CD«LBI 
Substituting  this  back  gives  us 

N  •  LU  »  XU  +  LBI  •  XR 

WO  =  rhoCb)  *  - 

N  •  LU  LBI 

In  addition  to  WO,  the  QR  transactions  may  have  to  wait  for 

any  preceding  QR  arrivals  to  be  processed,  giving  us  a  total 

average  time  in  queue  of 

n  •  LQR  •  (XR«*2) 

WQR  =  =  WO  ♦  - 

1  -  n  •  LQR  *  XR  . 

In  looking  to  describe  the  response  time,  we  realize  that 
the  Bis  and  updates  must  wait  for  the  control  token  to  arrive 
before  batch  processing  can  even  take  place.  Since 
communication  delays  between  successors  in  the  virtual  ring 
have  been  considered  constant,  the  uniform  distribution  tells 
us  the  delay  waiting  for  the  token  will  be  half  a  virtual 
circuit  on  the  average.  This  makes  transaction  response  times 
RTU  z  (N/2)  •  CD  ♦  WU  ♦  XU  , 

RTBI  z  (N/2)  •  CD  WBI  +  XR  ,  and 


Weighting  these  individual  times  by  the  probability  of  each 

transaction  type  gives  an  average  system  response  time  of 

A  1  B 

RT  = - *  RTU  + - •  RTBI  + -  •  RTQR 

A-*-1  (A  +  1)»(B+1)  (A+1)»(B+1) 

As  in  the  basic  case  for  synchronized  management,  it  is 
only  the  updates  which  contribute  network  message  traffic  to 
the  cost  prediction, 

NM  =  [  A/( A+1  )  ]  •  N  . 

B.2.4  Analysis  of  Delayed  Synchronization  Management 

Under  the  delayed  synchronization  scheme,  the 
low-priority  arrival  rate  is 

LI  =  n  •  LBI  •  [  A*(B+1)  N  ]  ,  since  each  node  handles 

its  own  updates,  while  the  BI  retrievals  complete  a  circuit  of 

a  virtual  ring  in  order  to  find  the  most  recent  copies  of 

data.  The  corresponding  average  service  time  for  the  class  is 

A»(B+1 )»XU  +  N»XR 

X1BAR  =  - 

A»(B-f1  )  +  N 

If  we  add  up  the  QR  arrivals  which  can  be  processed  at  a  given 
node,  including  local  requests  being  processed  locally  and 
requests  transferred  from  other  nodes,  the  aggregate  arrival 
rate  is 

L2  =  n  •  LQR  =  n  »  B  »  LBI 

and  the  average  service  time  for  the  QR  priority  class  is 


X2BAR  =  XR  . 
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From  these  quantities,  we  calculate 

n»LBI*  ([A*(B-f1  )»XU  ♦  N«XR]«»2) 

WO  - - - -  +  n»B»LBI*  (XR»»2) 

A*{B+1  )  N 

along  with 

sigmal  =  n  »  LBI  •  [  A  •  (B+1)  •  XU  (N+B)  •  XR  ]  .and 
sigma2  =  n  »  B  »  LBI  •  XR  . 

Using  the  now  familiar  expressions  for  total  time  by 
transaction  type, 

TU  =  W1  ♦  XU  , 

TBI  r  Wl  +  XR  ,  and 
TQR  =  W2  +  XR, 

and  the  frequencies  of  occurrence,  we  get 


A  1 

RT  =  ---  •  TU  + -  •  N  •  [  CD  +  TBI  ] 

A  +  1  (A+n»(B+1) 


B 

.  •  (  [p(  local)  ]«TQR  ♦  [1-p(  local)  ]»[2»CD-*.TQR]  ) 

( A-^1  )«(B+1  ) 

in  a  manner  analogous  to  the  basic  case. 

Network  messages  are  required  to  send  the  Bis  around  the 
virtual  circuit  and  to  pass  QRs  which  cannot  be  answered 
locally  to  the  nearest  neighbor.  Averaging  over  all  the  types 
of  transactions,  then,  we  get  a  cost  prediction  of 
1  B 

NM  = . •  N  ♦  . »  [l-p(local)]  •  2 

(Afl )»(B+1 )  (A+1)»(B+1) 

with  terras  in  powers  of  p(local)  neglected  from  our  locality 
of  reference  assumption. 
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